Department of Mathematics and Statistics, University of Missouri-Kansas City, 5100 Rockhill Road, Kansas City, MO, USA

Abstract

Background

One frequent application of microarray experiments is in the study of monitoring gene activities in a cell during cell cycle or cell division. A new challenge for analyzing the microarray experiments is to identify genes that are statistically significantly periodically expressed during the cell cycle. Such a challenge occurs due to the large number of genes that are simultaneously measured, a moderate to small number of measurements per gene taken at different time points, and high levels of non-normal random noises inherited in the data.

Results

Based on two statistical hypothesis testing methods for identifying periodic time series, a novel statistical inference approach, the

Conclusion

The

Background

Microarray experiments are widely used for gene profiling in different cell lines, various tissues, and conditions (normal versus cancerous). High throughput microarray technologies have made it possible to study problems that range from gene regulation and mRNA stability, to pathways for genetic diseases and the discovery of target subpopulations for drug or other therapies. One frequent application of microarray experiments is in the study of monitoring gene activities in a cell during cell cycle or cell division. A new challenge to statisticians for analyzing the microarray experiments is to identify genes that are statistically significantly periodically expressed during the cell cycle. Such a challenge occurs due to the large number of genes that are simultaneously measured, a moderate to small number of measurements per gene taken at different time points, and high levels of non-normal random noises inherited in the data (Wichert

In this paper, another test statistic, the Bartlett's exact

Results

For testing the null hypothesis of a signal being a normal white noise against the alternative hypothesis of a signal being periodic (see Methods section), a statistical method is to use the periodigrams of the signal (see Methods section for details) to form a test statistic and calculate the p-value of the test statistic. A small p-value, smaller than a predetermined significance level, indicates the significance of the signal being periodic rather than white noise. Fisher

where _{g }is the sample realization of the _{g}) is the largest integer less than 1/_{g}.

A more general setting of the hypothesis is to test whether a signal is normal white noise or not. Bartlett

where _{g }= _{g}, _{g }is given in equation (10) of the Methods section, [_{g}] = _{g}}, and

Step 1: Calculate

Step 2: Let the ordered

Step 3: For a given FDR level of _{q }be the largest _{q }be the largest

Step 4: The intersection set

A natural question that might come up is: What is the FDR level of the identified periodic genes contained in set K? A straightforward proof leads to the conclusion that the FDR level of the identified periodic genes contained in set

Analysis of the bacterial cell cycle data

The gene expression data from synchronized bacterium

The gene (ORF00082) in Laub data that is not considered periodic in this paper

The gene (ORF00082) in Laub data that is not considered periodic in this paper.

Analysis of the yeast cell cycle data

In the second example, the gene expression data sets from the well-known yeast

The same procedure is applied to the cdc15, cdc28, and elution data sets, and the genes identified by both statistics, their intersection set K, and the difference set D are summarized in Table

Number of Significant Periodic Genes Identified by C-statistics, G-statistic, Intersection Set K, and Difference Set D

Cell type

Experiment

_{
C
}

_{
G
}

_{
K
}

_{
D
}

bacteria

11

1474

166

44

43

123

Yeast

alpha

18

4489

1188

473

471

717

Yeast

cdc15

24

4381

1636

788

779

857

Yeast

cdc28

17

1383

292

27

27

265

Yeast

Elution

14

5766

1056

769

695

361

Human fibroblasts

N2

12

7077

1

2

1

0

Human fibroblasts

N3

12

7077

2

0

0

2

Human HeLa

Score1

12

15536

44

7

6

38

Human HeLa

Score2

26

16287

1351

154

153

1198

Human HeLa

Score3

48

41508

9702

6117

5770

3932

Human HeLa

Score4

19

40815

52

52

17

35

Human HeLa

Score5

9

35871

5

1

0

1

_{C}: number of significant genes picked up by C-statistic; _{G}: number of significant genes picked up by G-statistic; _{K}: number of significant periodic genes picked up by the intersection set K; _{D}: number of significant other periodic genes or other patterned genes picked-up by the difference set D.

The nine most significant periodic genes in Elution data

The nine most significant periodic genes in Elution data.

The nine most significant genes in set D for Elution data

The nine most significant genes in set D for Elution data.

Analysis of human fibroblasts data

In this example, the microarray data on the transcriptional profiling of the cell cycle in human fibroblasts will be analyzed. The experiments and data sets are reported in Cho _{a}_{a}

The two genes in set D of N3 data

The two genes in set D of N3 data.

Analysis of human cancer cell line data

In this last example, the human cancer cell line profiling data sets resulted from large-scale microarray experiments given in Whitfield

The six significant periodic genes in set K of Score1 data

The six significant periodic genes in set K of Score1 data.

Number of Periodic Genes Identified by the Original Experimenters, Wichert et al. (2004), and Chen

Cell type

Experiment

Experimenter

Wichert

Chen

bacteria

Laub

44

43

Yeast

alpha

Spellman

468

471

Yeast

cdc15

total of 800 periodic genes

766

779

Yeast

cdc28

identified in all of these four

105

27

Yeast

Elution

yeast cell cycle experiments

193

695

Human fibroblasts

N2

Cho

0

1

Human fibroblasts

N3

genes identified in N2 and N3

0

0

Human HeLa

Score1

Whitfield

0

6

Human HeLa

Score2

total of 800+ periodic genes

134

153

Human HeLa

Score3

identified in these five

6043

5770

Human HeLa

Score4

Human Cancer cell line

56

17

Human HeLa

Score5

experiments

0

0

Discussion

Regarding both of the test statistics, several points need to be addressed.

First of all, the

_{g }_{g }_{g }_{g}

Then, the fact that _{g }is great than its threshold value does not necessarily imply that _{g }is greater than its threshold value, and vise versa. In other words, from the fact given by (3), it is clear that these two statistics are not equivalent in general; there are times, however, that both tests overlap with each other. This is not surprising because the

Furthermore, the

Moreover, the behavior of the

Empirical power of

Signal type

sine signal with skewed noise

81.66%

75.25%

75.23%

sine signal with normal white noise

99.09%

97.57%

97.57%

Empirical power of

The ratio of amplitude of signal to noise

9:8

99.78%

99.50%

99.50%

10:8

99.97%

99.93%

99.93%

11:8

99.99%

99.99%

99.99%

12:8

100%

100%

100%

Empirical power of

The ratio of amplitude of signal to noise

7:8

96.00%

91.72%

91.72%

6:8

87.39%

78.40%

78.40%

5:8

71.95%

56.87%

56.82%

4:8

51.03%

34.30%

34.01%

Empirical false positive rate of

noise type

normal

12.3%

6.45%

4.29%

uniform

13.23%

7.60%

4.70%

Chi-square

7.32%

2.32%

1.36%

Finally, as the null distributions of these two statistics are all exact distributions, they work well (as long as the underlying assumptions are met) for any sample size (small or large). This characteristic makes both tests very valuable to microarray data sets as the observations obtained for each gene is usually not large in a microarray experiment.

Conclusion

In this paper a statistical C&

Methods

Suppose that a time series is observed and one concern is the possible periodicity of this time series. To be specific in the context of gene expressions observed at time _{g}(_{g}(

_{g}_{g}_{gt}

where _{g}(_{g }for gene _{g}(_{g}) = _{g}(_{gt }is a sequence of non-observable random errors with mean 0 and homogenous variance ^{2 }for all

_{g}_{gt}

where _{g}(_{g}(

where _{k }

for

for _{gt}'s are identically independently distributed normal random errors with mean 0 and homogenous variance ^{2 }(that is, _{g}(

_{0}: _{g}_{gt}

versus

_{1}: _{g}_{gt}

then for a fixed gene

For details on the

Other test statistics for searching "hidden periodicity" in a time series have been proposed as part of spectral analysis (Fuller

_{0}: _{g}(

versus

_{0}: _{g}(

for fixed gene

with

for

According to Fisher

where _{g }is the sample realization of the _{g}) is the largest integer less than 1/_{g}. Meanwhile, according to Durbin

where _{g }= _{g}, _{g }is given in (10), [_{g}] = _{g}}, and

The C&G Procedure utilizes both of the test statistics and gives a practical way for identifying significant periodic genes in massive microarray data.

Acknowledgements

This research is supported in part by the NSF grant DMS-0426148. Part of this work is done while the author is a visiting scientist at the Stowers Institute for Medical Research (SIMR) and is on leave from University of Missouri-Kansas City. The author thanks two anonymous referees whose comments greatly improved the manuscript.