Institute for Theoretical Biology, Humboldt University Berlin, Invalidenstraße 43, Berlin, D-10115, Germany

Institute for Theoretical Biology, Charité Universitätsmedizin, Invalidenstraße 43, Berlin, D-10115, Germany

Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, Vienna, A-1090, Austria

Faculty of Biology and Freiburg Initiative in Systems Biology, University of Freiburg, Schänzlestraße 1, Freiburg, D-79104, Germany

Global Change Research Center – CzechGlobe, Academy of Sciences of the Czech Republic, Belidla 986/4a, 60300 Brno, Czech Republic

Abstract

Background

The transcriptomes of several cyanobacterial strains have been shown to exhibit diurnal oscillation patterns reflecting the diurnal phototrophic lifestyle of the organisms. The analysis of such genome-wide transcriptional oscillations is often facilitated by the use of clustering algorithms in conjunction with a number of pre-processing steps. Biological interpretation is usually focused on the time and phase of expression of the resulting groups of genes. However, the use of microarray technology in such studies requires the normalization of pre-processing data, with unclear impact on the qualitative and quantitative features of the derived information on the number of oscillating transcripts and their respective phases.

Results

A microarray based evaluation of diurnal expression in the cyanobacterium _{2} mean ratio transformation. We use the cluster-wise functional enrichment of a clustering derived by LOS normalization, clustering using flowClust, and DFT transformation to derive the diurnal biological program of

Conclusion

Application of quantile normalization, median polishing, and also cyclic LOESS normalization of the presented cyanobacterial dataset lead to increased numbers of oscillating genes and the systematic shift of the expression phase. The LOS normalization minimizes the observed detrimental effects. As previous analyses employed a variety of different normalization methods, a direct comparison of results must be treated with caution.

Background

Photosynthetic organisms such as cyanobacteria have been shown to exhibit complex transcriptional remodeling with respect to diurnal variation of light availability

Microarray platform inherent technical limitations cause the resulting data to contain systematic or random technical variation in addition to the biological variation of interest

It is known that the application of any such global normalization methods has significant impact on subsequent analyses, in particular when some of the underlying assumptions on data structure are not or only partially fulfilled

In addition to the normalization steps, microarray analyses require a transformation which accounts for its semi-quantitative nature. Calibration methods for sequence-dependent hybridization energies and unspecific cross-hybridizations have been proposed _{2} mean ratio transformation (in the following:

The biological interpretation of microarray data is possible only after the application of the transformation and normalization. Due to its high-dimensional nature, a standard step in the interpretation of microarray data is clustering. A variety of clustering algorithms have been proposed, making it necessary to systematically evaluate the performance on gene expression data

The quantification of diurnal expression in the cyanobacterium

Results and discussion

A diurnal trend in the total chip signal

Cultures of the cyanobacterium

Oscillation of the unprocessed total signal.

**Oscillation of the unprocessed total signal.****A**) Prior to any pre-processing, the mean transcript abundance for all genes on the chip (blue dashed) and all 3347 protein-coding genes (black solid) exhibits diurnal oscillations. Significantly oscillating genes (_{osc}<0.05) resemble the oscillation of the total intensity (black dotted), whereas the non-significantly oscillating genes (_{osc}>0.05) show increased expression over the day and a peak at 17.5 CT (gray dashed). **B**) The majority of genes exhibit a phase angle **C**) The histogram of Spearman correlation coefficients

To characterize the periodicities present in the unnormalized data set, we calculated the phase of peak transcript levels and amplitudes for all protein-coding transcripts from the DFT component corresponding to the two LD cycles. Since our samples were taken at non-equidistant sampling intervals, the phases do not linearly correspond to the time domain, but reflect accurately the temporal sequence of transcript level peaks. The significance of periodic transcript levels (_{
o
s
c
}) was calculated from a permutation-based background model _{
o
s
c
}<0.05) reflect the observed global trend, while weak oscillators contain both this global trend and additional peaks at CT17.5, i.e., during the dark phases (Figure

Normalization leads to changed diurnal expression times

We tested the impact of four normalization methods which have either been previously used to analyze the temporal expression organization in cyanobacterial species _{
o
s
c
}, as above) in the raw data to define a least-oscillating set (LOS) of reference genes _{
o
s
c
}<0.05 retrieved 25% of all transcripts from raw data, 58% from median polished, 60% from quantiles-normalized, 64% from LOS-normalized and 35% from cLOESS-normalized data. At a very conservative cut-off of _{
o
s
c
}<0.001, the number of significant oscillators in cLOESS (1.7%) decreased below the level of raw data (raw: 2.2%; quantiles: 4.4%; median polishing: 4.9%; LOS: 7.8%).

While such numbers are interesting to illustrate the extent of transcriptional remodeling, the goal of a microarray analysis is to obtain a temporally resolved picture of the transcriptional landscape. Commonly, the time-series is reduced to a phase angle corresponding to the time during the course of a day where a transcript’s level peak. Thus, we tested the agreement of phase angles _{
o
s
c
}<0.05) from the diagonal can be observed for all but the LOS-normalized data. The deviation follows a strong systematic trend of the weakly or non-oscillatory transcripts (_{
o
s
c
}>0.05) towards earlier phases of transcript peaks. LOS-normalization has an opposite effect only on the weak oscillators, and shifts them systematically towards later phase, while strong oscillators remain unaffected. Under the assumption that technical noise is independently identically distributed amongst the individual samples (microarrays) of a time series, the removal of such noise contributions should not alter the observed phase of a periodic signal or introduce oscillatory behavior. Since quantile normalization, median polishing and cLOESS compensate for the observed global oscillatory trend, an anti-phase oscillation is introduced into weak oscillatory profiles leading to the large number of genes with phases <

Normalization changes phase angles and expression correlation.

**Normalization changes phase angles and expression correlation.** Systematic comparison of important properties of the expression profile set after normalization with different methods. Columns one to four correspond to the methods quantile normalization, median polishing, LOS, and cLOESS, respectively. Rows one to three correspond to plots of prominent average expression profiles, expression phase comparisons, and pairwise correlation distributions. The mean expression profiles for different gene groups illustrate the impact of normalization methods. A comparison of the unnormalized mean expression profile of all genes (dashed blue) with the normalized mean over all genes (black solid), significantly oscillating genes (_{osc}<0.05 in unnormalized data - black dotted) and not oscillating genes (_{osc}>0.05 in unnormalized data - gray dashed) is shown in panel **A** to **D**. The time of maximal expression in oscillatory profiles, measured using the Fourier transformation, is frequently altered by the normalization method. Panel **E** to **H** show the comparison between expression phases observed in the unnormalized (x-axis) versus normalized (y-axis) data. Profiles with significantly oscillating expression (_{osc}<0.05) are shown in black, whereas weak or non-oscillators are shown in gray (_{osc}>0.05). The histogram of pairwise Spearman correlation coefficients between expression profiles as proxy of the diversity of the global expression landscape is shown in panels **I** to **L**.

To better understand the effects of the different normalization methods, we chose another way of characterizing the data, i.e., the pairwise correlation between expression profiles. Before normalization, the distribution of the pairwise Spearman correlation (Figure

It has been noted before, that not only the background model, but also the type of data preprocessing can strongly affect the observed periodicity in a microarray dataset

Normalization and transformation shape clustering results

A common way of interpreting microarray expression data is clustering analysis. Clustering of data is often used to identify the temporal or functional organization of regulatory processes occurring, e.g., over one diurnal cycle

This study focuses on a selection of seven popular clustering approaches based on diverse underlying principles which are described in more detail in the methods section. With K-means

**Supporting Information.** A document providing supplementary figures.

Click here for file

The Euclidean distance and Spearman correlation coefficient were used separately as similarity measure if allowed by the clustering algorithm. Both measures differ fundamentally, since the Euclidean distance captures the absolute difference between each value of two time series whereas the Spearman correlation focuses on the relative differences.

To explore the large number of clusterings obtained from all combinations of the considered processing steps, the pairwise similarity between clusterings was measured using mutual information (MI, see Methods section for details). These pairwise similarities can be arranged in a matrix where each row and column corresponds to one individual clustering. When rows and columns are ordered identically this yields a diagonal matrix as shown in Figure

Clustering results are determined by the normalization.

**Clustering results are determined by the normalization.** Pairwise similarity between all clusterings with eight clusters, similarity is measured using mutual information. White encodes minimal similarity over gray to black for maximal similarity. Rows and columns of the symmetrical matrix are ordered identically according to hierarchical clustering (Hclust, complete link method) of the similarities, represented as dendrogram on the left. The normalization method applied to the data before clustering is color-coded: no normalization - blue, median polishing - yellow, LOS - green, cLOESS - cyan, quantile normalization - red. The remaining processing steps (clustering algorithm, similarity measure, transformation) are represented as black bars in the corresponding column on the right. The column “correlation” marks the usage of the Spearman correlation coefficient as similarity measure except for clusterings obtained from SOTA, which only allows usage of the Pearson correlation.

We now asked whether the branches of the dendrogram correspond to particular parameters chosen to obtain the corresponding clustering. The specific parameter combination for each row of the similarity matrix is represented as annotation matrix on the right. This annotation matrix contains a column for every clustering algorithm, transformation, and similarity measure and black marks indicate usage in the corresponding rows clustering. The normalization method is color-coded on the left/top of the similarity matrix.

Visual inspection of the normalization method pattern and the annotation matrix reveals six large subgroups A–F (Figure

Subgroups A and B, quantile-normalized and raw data, contain a sub-branch of clusterings that are based on other normalization methods. Inspection of the data transformation methods (Figure

The similarity matrix in Figure

Comparison of the pairwise clustering similarity shows that the normalization method determines the clustering result more than any other step. Furthermore, the difference of the

LOS agrees best with biological knowledge

The implications of the observed normalization effects for the biological data interpretation are demonstrated for selected genes as well as the functional enrichment of a complete clustering result. First, we examined the set of significantly oscillating genes which exhibit large phase shifts after data normalization. As an example, the expression profiles of four such genes are shown in Figure

Phase changes in high amplitude diurnal expression profiles due to normalization.

**Phase changes in high amplitude diurnal expression profiles due to normalization.** The expression profiles of four genes with clear diurnal oscillations before and after normalization with several methods using **A**) and **B**) are functionally associated with the photosynthesis and exhibit induced expression over the day. The expression phase **C**) and **D**) have transposon-related functions and are phase shifted by ≈160° after quantile normalization.

For the gene

As demonstrated, the choice of normalization methods can change the qualitative properties of the experimental data. While it is possible, that the global oscillatory trend is an experimental artifact and thus should be removed, this removal (e.g. by quantile normalization) leads to the conversion of day-active oscillators into night-active ones. Especially for the two photosynthesis-related genes

Conservative normalization gives biologically reasonable results

Finally, it remains to be shown that the presented data set and the processing provide a biologically reasonable picture. As demonstrated, the LOS normalization shows the least impact on the data and was consequently used in this analysis step. Visual inspection of clustering results revealed very good performance of flowClust with DFT transformation, where cluster-wise coherence of shape and phase of expression profiles were used as prominent criteria. From the range of optimal cluster numbers (8-10) according to the Bayesian information criterion as obtained from flowClust (see Additional file

Clustering after LOS normalization yields coarse biological program.

**Clustering after LOS normalization yields coarse biological program.** The clustering of LOS normalized DFT transformed data using the flowClust approach with ten clusters is shown in panel **A**. The gray lines represent individual gene profiles, the solid colored line marks the cluster mean profile, and the dashed colored lines mark the 5% and 95% quantiles. For visualization the **A**, is presented in panel **B**. The rows of this matrix correspond to biological functions whereas the columns correspond to clusters, where the color marks on the top match the colors used for the cluster mean profiles. The number of genes with the corresponding function is shown on the top of each cell and the enrichment p-value on the bottom. Furthermore, the enrichment p-value is color-coded in the cell background, marking highly significant enrichments in black and non-significant enrichments in white. The rows were rearranged to reveal the temporal ordering.

Most importantly, the three photosynthesis-related clusters 5,7, and 8 peak as expected in the morning, midday, and evening, respectively. The expression of components of the transcriptional and translational machinery in cluster 1 increases sharply during the DL transition. This could be explained by the extensive metabolic changes due the transition from respiration to photosynthesis as well as the induction of a variety of processes to utilize the readily available photosynthetic energy. Only with slight delay, the expression of amino acid biosynthesis related genes increases possible to provide the basic elements for protein synthesis. In contrast to protein synthesis, C_{2} fixation related genes show an increased expression in the second half of the day (cluster 8). This behavior might reflect a separation between protein synthesis and cellular maintenance during the first half of the day and an accumulation of storage metabolites during the second half as preparation for the night as observed, e.g., in

Conclusions

The expression of a large number of genes oscillates diurnally in a variety of cyanobacterial strains. In the microarray-based evaluation of diurnal patters in the transcriptome of the cyanobacterium _{
o
s
c
}) and clustering analyses to systematically compare the impact of four normalization methods on the presented dataset.

We found that the popular methods median polishing, quantile normalization and cyclic LOESS (cLOESS) normalization systematically change the expression phase of oscillating genes compared to the unnormalized data. This expression phase information is best preserved by the least oscillating set (LOS) normalization, which attributes changes in the least oscillating genes to technical variation and preserves the global oscillatory trend. Analysis of the expression profile correlation shows only minimal impact of the LOS normalization. In contrast, quantile normalization and median polishing strongly alter the original correlation structure by introducing anti-phasic oscillations. Only cLOESS suppresses oscillations without introducing anti-phasic ones. Moreover, the numbers of oscillating genes differ vastly between the different normalization methods. The reason for these normalization side effects is the oscillation in the mean transcript abundance. Only LOS normalization avoids the removal of this global trend and thereby avoids introduction of new anti-phasic oscillations or severe dampening of observed oscillators. On the other hand, LOS normalization may de-emphasize potential real but weak biological periodicities that are superimposed by the global trend, i.e., transcripts that may specifically peak during the night phase. The mechanism which leads to the oscillation in the mean transcript abundance, despite the consistent application of 1.5_{2} mean ratio transformation, which emphasizes amplitude information more than the standardization and DFT transformation. Since this amplitude information can not be interpreted in a quantitative manner, it should be removed by standardization and DFT transformation to allow for exclusive clustering by the pattern of change. Comparison of existing biological knowledge shows that the combination of LOS normalization, clustering using flowClust and DFT transformation, and functional enrichment analysis of the resulting clusters outline the basic diurnal biological program of

While our analysis was focused on a specific dataset obtained for the cyanobacterium

In the light of these analyses, it is possible that the descriptions of large scale oscillatory gene expression and, in particular, expression timings in different cyanobacterial species are biased by the normalization methods employed. To overcome this challenge, more robust multi-chip normalization methods must be considered when studying temporal expression organization. Importantly, the exact source of a diurnal trend in the total chip signal, despite experimental normalization, requires further experimental characterization.

Methods

The synechocystis sp. PCC 6803 time series expression dataset

^{−2}
^{−1} and a continuous stream of air. The optical density of the culture was monitored by measuring the absorbance at 750 nm. Cultures were synchronized with three cycles of light/dark 12 h:12 h prior sampling. Aliquots were taken at OD _{750} ≈0.5. Over a 24 h time course, 6 samples for RNA isolation were taken at the following time points: 30 minutes before and after light is switched off, (sample 1 - CT 11.5 and sample 2 - CT 12.5), 30 minutes before midnight (sample 3 - CT 17.5), 30 minutes before and after light onset (sample 4 - CT 23.5 and sample 5 - CT 0.5) and 30 minutes before noon (sample 6 - CT 5.5). Cells were filtered rapidly through Supor

Data transformation

The brightness of spots in a microarray experiment, from which the expression strength is derived, depends not only on the number of mRNAs in the sample, which is applied to the array chip. Large differences in hybridization energy and experimental effects like cross hybridization lead to expression values, which span several orders of magnitude and of which only relative changes for one probe set between the conditions can be interpreted. By the use of different transformations, it is common to bring raw expression data into the same order of magnitude. To allow for comparability, we also include the raw data in every step of our analysis.

Log2 mean ratio

The

where

Standardization (Z transformation)

The standardization is defined as

where _{
x
} denotes the standard deviation of the genes expression profile from its average, which is calculated as

for an expression profile

Discrete fourier transformation

A series of measurements _{0},...,_{
N−1}}, acquired at times {_{0},...,_{
N−1}}, can be approximated as a set of sine-functions with different frequency and amplitude. This transformation into frequency-space is done by applying the Discrete Fourier Transform (DFT) to each gene’s time series

where _{
k
} represents a sine with period _{
k
}=(_{
N−1}−_{0})/_{0} represents the non-oscillating component or an offset from 0 of the time series. For each component _{
k
} the amplitude _{
k
} and the phase angle _{
k
} can be calculated as _{
k
}=|_{
k
}|/_{
k
}=^{−1}(_{
k
})/_{
k
})). Since the obtained spectrum is symmetrical relative to _{
k
} provide a distorted measure of the diurnal expression time due to the non-equidistant sampling. However, the phase angles provide an excellent means to obtain a temporal order of oscillating expression patterns.

To be able to cluster these frequency spectra, we discard the uninformative non-oscillating component _{0} and the highest frequency component _{6} and create a series of values out of the 5 real and imaginary parts of the remaining frequency spectrum for every gene. This component omission can be interpreted as subtracting the mean for each gene’s time series. For the remaining components _{
k
}, the amplitude is scaled to emphasize the shape of the expression pattern instead of the absolute amplitude, which is less informative for microarray data. Therefore, the scaled amplitude _{
k
} is the amplitude at component

Detection of periodic expression profiles

As proposed previously, a permutation-based method is used to detect diurnal periodic expression profiles _{
k
}, its significance can be assessed by the probability _{
o
s
c
} to observe _{
k
} in a random permutation of the original time series. Therefore, we calculated the Fourier spectra of 100000 random permutations of each time series and calculated the empiric relative probability for each _{
k
} to observe a Fourier coefficient equal or larger in a random permutation.

It must be emphasized that the Fourier transform uses a sine function as underlying model which in case of a sinusoidal expression profiles leads to a distinct peak in _{
k
} receives a higher probability in the permutation background model.

Data normalization

Strategies for the compensation of experimental variations in multi-chip experiments are generally considered necessary. Basis for such approaches are assumptions of similarity between different arrays in the same experiment.

The quantile-normalization approach by Bolstad

Median polishing

With the LOESS normalization

In addition, with the least oscillatory set (LOS) normalization we propose a method which is related to the least variant set normalization (LVS)

While LVS attempts to define a set of housekepping genes by finding profiles with minimal array-to-array variation (after partitioning the observed variation into array-to-array variation, within-probeset variation and residual variation), LOS follows a more intuitive approach. Here, housekeeping genes are defined as the set, which exhibits the least pronounced diurnal oscillations (measured by oscillatory p-value _{
o
s
c
}). Defining the lower cutoff _{
o
s
c
}>0.7 and considering all transcripts on the chip yields a LOS set of 1173 expression profiles. The global mean expression for each array is shown in Additional file

Clustering algorithms

From the plethora of clustering algorithms, which have been proposed for the clustering of expression data, we chose a diverse set of 7 methods which cover different principles of clustering.

K-means

The non-hierarchical K-means clustering algorithm is implemented in the R-function

taking 1 minus the correlation coefficient.

Partitioning Around Medoids (PAM)

Similar to K-means, PAM is a non-hierarchical clustering algorithm that partitions the data by attempting to minimize the squared error of a distance measure

Hclust

The bottom-up hierarchical cluster

Self-Organizing Maps (SOM)

The non-hierarchical Self Organizing Map (SOM) approach represents multidimensional data in a low-dimensional topological map. The grid used here is one-dimensional and the number of grid points equals the number of clusters

Self-organising tree algorithm (SOTA)

The top-down approach called self-organising tree algorithm or SOTA was proposed as strategy for phylogenetic reconstruction

Mclust

We included a non-hierarchical model-based clustering approach using expectation maximization initialized by hierarchical clustering for parametrized Gaussian mixture models

flowClust

As a second member of the family of model-based clustering methods we chose flowClust

Clustering comparison

Adjusted rand index

The Rand index

The adjusted Rand index furthermore accounts for similarities in the clusterings which are expected by chance. The adjusted Rand index values are in the interval [0,1] where 1 is reached by maximally similar and 0 by maximally dissimilar clusterings. We use the R-implementation of the adjusted Rand index in function cluster.stats (package: fpc).

Mutual information

The mutual information is defined as

where _{1}(_{2}(

Normalized variation of information

The variation of information was proposed by Meila

where

The construction of a clustering result comparison similar to Figure

**R Script demonstrating application of the considered clustering algorithms.** A document, describing the application of clustering algorithms to time series expression data, using the statistical programming language R.

Click here for file

Functional enrichment analysis

The functional enrichment analysis was performed using the gene annotations as provided by the Cyanobase database

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

IA provided the biological samples. JG performed the microarray measurements. RL implemented the presented analysis, interpreted the data and wrote the manuscript. RM was involved in the developing of the ideas presented in this paper, the implementation of the analysis and the interpretation of the results. MB was involved in writing the manuscript. RS was involved in the developing of the ideas presented in this paper and the interpretation of the results. All authors read and approved the final manuscript.

Acknowledgements

The authors are grateful to Anne Rediger, Anika Wiegard, Stefanie Hertel and Christian Beck for collecting culture samples bravely, from dusk till dawn and dawn till dusk. This work has been supported financially through the German Ministry of Education and Research (BMBF), FORSYS partner program (grant number 0315294), the Deutsche Forschungsgemeinschaft (DFG), and the Einstein Foundation Berlin. RL acknowledges funding by the Research Training Group on Computational Systems Biology (GRK1772). RS is financially supported by the project "Local Team and International Consortium for Computational Modelling of a Cyanobacterial Cell", Reg. No. CZ.1.07/2.3.00/20.0256. RL, RS, and IMA are supported by the project "Übergangsmetalle und phototrophes Wachstum: Ein neuer Ansatz der constraint-basierten Modellierung grosser Stoffwechselnetzwerke" funded by the Einstein Foundation Berlin.