Abstract
Background
The AudicClaverie method [1] has been and still continues to be a popular approach for detection of differentially expressed genes in the SAGE framework. The method is based on the assumption that under the null hypothesis tag counts of the same gene in two libraries come from the same but unknown Poisson distribution. The problem is that each SAGE library represents only a single measurement. We ask: Given that the tag count samples from SAGE libraries are extremely limited, how useful actually is the AudicClaverie methodology? We rigorously analyze the AC statistic that forms a backbone of the methodology and represents our knowledge of the underlying tag generating process based on one observation.
Results
We show that the AC statistic and the underlying Poisson distribution of the tag counts share the same mode structure. Moreover, the KL divergence from the true unknown Poisson distribution to the AC statistic is minimized when the AC statistic is conditioned on the mode of the Poisson distribution. Most importantly, the expectation of this KL divergence never exceeds 1/2 bit.
Conclusion
A rigorous underpinning of the AudicClaverie methodology has been missing. Our results constitute a rigorous argument supporting the use of AudicClaverie method even though the SAGE libraries represent very sparse samples.
Background
It is of utmost importance for biologists to be able to analyze patterns of expression levels of selected genes in different tissues possibly obtained under different conditions or treatment regimes. Even subtle changes in gene expression levels can be indicators of biologically crucial processes such as cell differentiation and cell specialization [2]. Measurement of gene expression levels can be performed either via hybridization to microarrays, or by counting gene tags (signatures) using e.g. Serial Analysis of Gene Expression (SAGE) [3] or Massively Parallel Signature Sequencing (MPSS) [4] methodologies. The SAGE procedure results in a library of short sequence tags, each representing an expressed gene. The key assumption is that every mRNA copy in the tissue has the same chance of ending up as a tag in the library. Selecting a specific tag from the pool of transcripts can be approximately considered as sampling with replacement. The key step in many SAGE studies is identification of "interesting" genes, typically those that are differentially expressed under different conditions/treatments. This is done by comparing the number of specific tags found in the two SAGE libraries corresponding to different conditions or treatments. Several statistical tests have been suggested for identifying differentially expressed genes through comparing such digital expression profiles, e.g. [1,2,5,6].
Audic and Claverie [1] were among the first to systematically study the influence of random fluctuations and sampling size on the reliability of digital expression profile data. Typically, cDNA libraries contain a large number of different expressed genes and observing a given cDNA qualifies as a rare event [1]. For a transcript representing a small fraction of the library and a large number N of clones, the probability of observing x tags of the same gene will be wellapproximated by the Poisson distribution parametrized by λ ≥ 0.
The unknown parameter λ signifies the number of transcripts of the given type (tag) per N clones in the cDNA library. When comparing two libraries, it is assumed that under the null hypothesis of not differentially expressed genes the tag count x in one library comes from the same underlying Poisson distribution P(·λ) as the tag count y in the other library. However, each SAGE library represents a single measurement only. From a purely statistical standpoint resolving this issue is potentially quite problematic. One can be excused for being rather skeptical about how much can actually be learned about the underlying unknown Poisson distribution from a single observation.
The key instrument of the AudicClaverie approach is a distribution (yx) over tag counts y in one library informed by the tag count x in the other library, under the null hypothesis that the tag counts are generated from the same but unknown Poisson distribution. (yx) is obtained by Bayesian averaging (infinite mixture) of all possible Poisson distributions P(yλ') with mixing proportions equal to the posteriors p(λ'x) under the flat prior over λ. When the two libraries are of the same size, we obtain [1]:
We will refer to (yx) as AudicClaverie statistic (AC statistic) based on counts x and y. Note that (yx) is symmetric, i.e. for x, y ≥ 0, (yx) = (xy). Audic and Claverie [1] point out that this is a desirable property, since if the counts x, y are related to two libraries of the same size, they should be interchangeable when analyzing whether they come from the same underlying process or not. The AC statistic (yx) can be used e.g. for principled inferences, construction of confidence intervals, statistical testing etc. For further details regarding the derivation and mathematical treatment of the AC statistic see [1].
Even though there have been further developments in comparison techniques for cDNA libraries (e.g. while Audic and Claverie [1] only deal with two libraries, Stekel et al. [7] suggest an approach to compare gene expressions across multiple cDNA libraries; for links to further approaches see [2]), the AudicClaverie method has been and still continues to be a popular approach in current biological research, e.g. [817], with 427 citations (based on ISI Web of Knowledge), over 100 citations in the past 3 years. Given the widespread use of the AudicClaverie method, it is somewhat surprising that a rigorous underpinning of the methodology has not yet been fully developed. Audic and Claverie did demonstrate the desirable behavior of their method through Monte Carlo simulations randomly sampling tags based on two experimentally obtained sequence tag distributions [1]. The rate of false alarm, e.g. how often random fluctuations in tag counts are interpreted as significant differences, was small for genes associated with small tag counts and increased for higher tag counts, but never exceeded the significance level of the test. Of course, one may argue that false alarm rate (false positives) is only one side of the story and ideally one would like to minimize both the false positive and false negative rates. The false negative rate quantifies how often significant differences get interpreted as just random fluctuations. However, of equal importance is the issue of why the AudicClaverie approach seems to be wellbehaved, e.g. when compared to an approach based on Ricker's confidence intervals (see [1]). In this contribution, we provide rigorous arguments as to why the AudicClaverie method can be expected to work well, even though from the purely statistical standpoint one could be excused for being skeptical. We start by assuming that for a given gene there is a hidden (unobserved) underlying Poisson distribution generating the tag counts. We then go beyond simple MonteCarlostyle verification by rigorously studying how much and in what form can be actually learned about the distribution in the AudicClaverie framework, given a single observation provided by a SAGE library. In particular, we ask:
1. How natural is the AC statistic's representation of the underlying unknown Poisson distribution governing the tag counts?
2. Given that the observed tag count sample is very limited, how well can the AudicClaverie approach work, i.e. how well does the AC statistic capture the underlying Poisson distribution?
Methods
Basic properties of the AC statistic
In this section we answer the first question posed above. It turns out that the AC statistic and the underlying Poisson distribution are quite similar in their nature: for any (integer) mean tag count λ ≥ 1, the Poisson distribution P(·λ) has two neighboring modes located at λ and λ  1, with P(λλ) = P(λ  1λ). When it comes to the observed tag counts, given a count x ≥ 1, the AC statistic (yx) has two neighboring modes, one located at y = x, the other at y = x  1, with (xx) = (x  1x). As in Poisson distribution, the values of (yx) decrease as one moves away from the modes in both directions.
Theorem 1 Let x, y and d be integers with ranges specified below. It holds:
1. (xx) > (x + dx) for any x ≥ 0 and d ≥ 1.
2. For x ≥ 1, (xx) = (x  1x).
3. (xx) > (x  dx) for any x ≥ 2 and 2 ≤ d ≤ x.
 Proof:
1. We have
In particular,
Hence,
Now, for x ≥ 0, we have
This can be easily seen, as for j ≥ 1, 2(x + j) > 2x + j. It follows that
2. and 3) For d ≤ x,
Hence,
If d = 1,
When 2 ≤ d ≤ x, we have for all j such that 1 ≤ j ≤ d  1,
This follows from 2(x  d + j) <2x  d + j, which can be easily verified, since for j ∈ {1, 2,..., d  1}, we have (j  d) > 2·(j  d).
For j = d, we have the equality (2x  d + j)/(x  d + j) = 2.
Finally, form (4),
Q.E.D
We have shown that after observing a count x, the AC statistic expects counts y = x and y = x  1 with the highest and equal probability. The other values of count y are, as one would naturally expect, less probable.
As an illustrative example we show in figure 1 plots of both the AC statistic (yx) and the corresponding Poisson distribution P(yλ) at λ = x for two values of x, x = 10 and x = 30. As a result of Bayesian averaging in the AC statistic, (yx) is less peaked at its modes than the Poisson counterpart P(yx). However, both (yx) and P(yx) have two modes located at x and x  1.
Figure 1. AC statistic vs. Poisson distribution. Graphs of AC statistic (yx) (solid line) and the corresponding Poisson distribution P(yλ) at λ = x (dashed line) for x = 10 (A) and x = 30 (B).
Information theory of the AC statistic
We now answer, in the framework of information theory, the second question posed in the 'Background' section. Assume that there is some "true" underlying Poisson distribution P(yλ) (1) over possible counts y ≥ 0 with unknown parameter λ. In the same process, we first generate a count x and then use the AC statistic (yx) (3) to define a distribution over y, given the already observed count x. We ask: How different, in terms of KullbackLeibler (KL) divergence, are the two distributions over y? For the AC statistic to work, one would naturally like (yx) to be sufficiently representative of the true unknown distribution P(yλ). In other words, one would expect P(yλ) and (yx) to be close, with the smallest "distance" at (yx = λ) (for λ integer), that is, when count x is exactly equal to the expected tag count under the Poisson distribution P(yλ). In this section we provide a quantitative answer to the above question and show, perhaps surprisingly, that the "statistical distance" between P(yλ) and (yx) is not minimized at x = λ, but it attains minimum at the mode of P(yλ), i.e. when x = λ  1.
First, define the KL divergence from P(yλ) to (yx):
The divergence D(λ, x) has a nice informationtheoretic interpretation: When the log is base 2, D(λ, x) expresses the number of bits of additional information one needs in order to fully specify (yx), provided one has a perfect knowledge of P(yλ). The divergence D(λ, x) is nonnegative, with D(λ, x) = 0 if and only if the two distributions (yx) and P(yλ) coincide.
Naturally,
where
is the entropy of P(yλ) and E_{Q(y)}[f (y)] denotes the expectation of the quantity f (y) under the distribution Q(y).
We have
and so
where for each integer d ≥ 0,
As discussed above, one would intuitively expect D(λ, x) to be minimal for x = λ, as then the conditioning count in the AC statistic would be the mean of the underlying Poisson distribution. However, the mode of that Poisson distribution, λ  1, is surrounded by enough probability mass to yield the following result:
Theorem 2 For any integer λ ≥ 1, it holds D(λ, λ) > D(λ, λ  1). In other words,
 Proof:
Now,
and by Jensen's inequality,
By (8), D(λ, λ)  D(λ, λ  1) = log(2λ) + F(λ, λ  1)  F(λ, λ), and since
we have D(λ, λ)  D(λ, λ  1) > 0, implying D(λ, λ) > D(λ, λ  1).
Q.E.D
We proceed our investigation by asking the following question: Given an underlying Poisson distribution P(xλ), if we repeatedly generated a "representative" count x from P(xλ), what would be the average divergence of the corresponding AC statistic (yx) from the truth P(yλ)? In other words, we are interested in the quantity
Lemma 3 For any λ ≥ 0,
 Proof:
we write
The last equality follows from E_{P(y2λ)}[y] = 2λ and
Let us now evaluate
Using Malmstén's formula again, we obtain
Expansion similar to that in (12) leads to:
Plugging (14) into (13) we obtain (11).
Q.E.D
We will now show that up to terms of order O(λ^{1}), the expected divergence of AC statistic (yx)] from the true underlying Poisson distribution P(yλ) is equal to (1/2) log 2.
Theorem 4 Consider an underlying Poisson distribution P(·λ) parametrized by some λ > 0. Then
 Proof:
and
we have
By lemma 3,
We next approximate the terms F(λ, 0) and F(2λ, 0). To that end, note that the entropy H[P(yλ)] can be approximated as [18]
Hence,
By the same token
Plugging (18) and (19) into (17) we obtain
Q.E.D
In fact, one can obtain a more precise characterization of the expected divergence ε(λ) by using a higher order entropy expansion (for log base 2):
After expressing F(λ, 0) and F(2λ, 0) in the style of (18) and (19), respectively, we obtain an expression for the expected divergence measured in bits:
Figure 2 presents values of the expected divergence ε(λ) (measured in bits) calculated numerically from the definition (9), as well as their analytical approximation calculated from (20). As expected, the two curves are in good correspondence, as our approximation is O(λ^{3}).
Figure 2. Expected KL divergence from the underlying Poisson distribution to AC statistic. Expected KL divergence ε(λ) (measured in bits) from the true unknown Poisson distribution to the AC statistic (solid line) and its analytical approximation (20) (dashed line).
Results of this section suggest that if the true Poisson source P(·λ) is not known, the AC statistic (yx), based on a single observed tag count realization x from P(·λ), is on average not further away from the truth P(yλ) than half a bit of additional information. As the mean tag count λ increases, so does the uncertainty in the generating Poisson distribution P(·λ). As a consequence, the average KL divergence ε(λ) from P(·λ) to the approximating AC statistic (based on a single realization from P(·λ)) gets larger. The average KL divergence expressed in bits increases with increasing λ from about 0.42 bits to 0.5 bits.
Results and Discussion
The AudicClaverie method [1] has been and still continues to be a popular approach for detection of differentially expressed genes in the SAGE framework. The method is based on the assumption that under the null hypothesis the tag counts x, y in two libraries come from the same but unknown Poisson distribution P(·λ). The problem is that each SAGE library represents only a single measurement. We have rigorously analyzed usefulness of the AudicClaverie method by investigating the AC statistic (yx) that forms a backbone of the method and represents our knowledge of the underlying Poisson distribution P(·λ) based on only one tag count x drawn from it.
It turns out that the Poisson distribution is rather "rigid" in the sense that it is unimodal and parametrized by a single parameter λ representing both its mean and variance. Learning about P(·λ) from a very limited sample (as one is effectively bound to do in the SAGE framework) is much less suspicious than one might naively expect.
We have first shown that the AC statistic (yx), even though not a Poisson distribution itself, naturally captures the distribution of further tag counts y, given a single observation x from the unknown P(·λ). According to Theorem 1, for integer λ, both (·x) and P(·λ) have two neighboring modes with decreasing probability values as one moves away from the modes in either direction. In particular, P(·λ) has the modes located at λ and λ  1, with P(λλ) = P(λ  1λ). Given a tag count x ≥ 1, (yx) has the modes located at x and x  1, with (xx) = (x  1x).
We then analyzed how 'close' is the AC statistic (·λ) (in terms of KL divergence) to the underlying Poisson distribution P(·λ) of tag counts. It turns out that the KL divergence from P(yλ) to (yx) is minimized at the mode of P(yλ), i.e. when x = λ  1 (Theorem 2). Most importantly, by Theorem 4, on average, the AC statistic is never too far from the true underlying distribution. To be precise, up to terms of order O(λ^{3}), on average, the AC statistic is never further away from the truth P(·λ) than halfabit of additional information. Hence, the AudicClaverie method can be expected to work well even though the SAGE libraries represent very sparse samples.
So far the AudicClaverie methodology for detection of differentially expressed genes has been verified only empirically through a series of specific Monte Carlo simulations [1]. It has not been clear how general the apparently stable simulation findings were. Besides detailed explanations of the nature of AC statistic capturing the unknown Poisson distribution based on single observation only, we showed that the AC statistic is universally applicable in any situation where inferences about the underlying Poisson distribution must be made based on an extremely sparse sample. Such situations are referred to in machine learning as 'oneshotlearning'. In the Monte Carlo simulations of [1] the false alarm rate was small for genes associated with small tag counts and gradually increased for higher tag counts. The false alarm rate, however, never exceeded the significance level of the test. These findings are consistent with the theoretically calculated divergence function ε(λ) (eq. (20)) illustrated in figure 2. With increasing mean tag count λ, it is more likely that increased counts x will be observed. But as λ increases, so does the uncertainty in the generating Poisson distribution P(·λ). Consequently, the average KL divergence ε(λ) from P(·λ) to the approximating AC statistic (based on a single realization x from P(·λ)) gets larger. For smaller λ the underlying Poisson distribution is well captured by the AC statistic and the test that operates on it will be well behaved. As λ grows, the average KL divergence ε(λ) saturates at 0.5 bits implying that the test based on the AC statistic will continue to be well behaved even for large values of the mean tag count λ.
The AudicClaverie method has also been formulated for the case of two cDNA libraries of unequal size. Similar methodologies have been proposed for the case of multiple cDNA libraries (e.g. [7]). Even though developed under the limited assumption of two libraries of the same size, theoretical results obtained in this paper offer deep insights into the workings of the AudicClaverie approach and provide an information theoretic justification for its use when analyzing expression patterns in cDNA arrays. Of course, when using libraries of unequal size, the AC statistic will no longer be symmetric, putting more weight on the more populated library. Information theoretic investigation of statistics developed for pattern analysis in the cases of unequal multiple libraries is a matter for our future work.
Conclusion
Detection of differentially expressed genes is a crucial step in any large scale automated analysis of patterns of gene expression data. One of the most popular techniques for identifying genes with statistically different expression in SAGE libraries is the methodology of Audic and Claverie [1]. The methodology relies on learning the underlying Poisson distribution of tag counts from a single observation from it in the form of (AC statistic). In this paper we rigorously analyzed the AC statistic. We have shown that under the null hypothesis of not differentially expressed genes:
1. The AC statistic and the underlying Poisson distribution share the same mode structure.
2. The KL divergence from the true unknown Poisson distribution to the AC statistic is minimized when the AC statistic is conditioned on the mode (not mean) of the Poisson distribution.
3. The expected KL divergence from the true unknown Poisson distribution to the AC statistic is never larger than 1/2 bit, irrespective of the mean of the Poisson distribution.
4. The expected KL divergence from the true unknown Poisson distribution to the AC statistic can be approximated up to order O(λ^{3}) by a simple function of the form a_{0 }+ a_{1}λ^{1 }+ a_{2}λ^{2}. For the divergence measured in bits, a_{0 }= 1/2, a_{1 }= 1/24 and a_{2 }= 1/32.
Even though the AC statistic infers the unknown underlying Poisson distribution based on one count observation only, the AudicClaverie method should work reasonably well in most cases, since under the null hypothesis, the average divergence from the unknown Poisson distribution to the AC statistic is guaranteed not to exceed 1/2 bit. This constitutes a rigorous quantitative argument, extending the empirical Monte Carlo studies of [1], that supports the wide spread use of AudicClaverie method, even though by their very nature, the SAGE libraries represent very sparse samples.
Authors' contributions
I am the sole author of this paper.
Acknowledgements
I would like to thank Hong Yan for introducing me to the problem of cDNA array analysis and Somak Raychaudhury for inspiring me to study estimation of Poisson processes based on extremely limited samples.
References

Audic S, Claverie J: The significance of digital expression profiles.
Genome Res 1997, 7:986995. PubMed Abstract  Publisher Full Text

Varuzza L, Gruber A, de B Pereira C: Significance tests for comparing digital gene expression profiles.
Nature Precedings 2008.
npre.2008.2002.3

Velculescu V, Zhang L, Vogelstein B, Kinzler K: Serial analysis of gene expression.
Science 1995, 270:484487. PubMed Abstract  Publisher Full Text

Brenner S, Johnson M, Bridgham J, Golda G, Loyd D, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, et al.: Gene expression analysis by massively parralel signature sequencing on microbead arrays.
Nature Biotechnol 2000, 18:630634. Publisher Full Text

Ruijter J, Kampen AV, Baas F: Statistical evaluation of SAGE libraries: consequences for experimental design.
Physiol Genomics 2002, 11(2):3744. PubMed Abstract

Ge N, Epstein C: An empirical Bayesian significance test of cDNA library data.
Journal of Computational Biology 2004, 11(6):11751188. PubMed Abstract  Publisher Full Text

Stekel D, Git Y, Falciani F: The comparison of gene expressiom from multiple cDNA libraries.
Genome Research 2000, 10:20552061. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bortoluzzi S, Coppe A, Bisognin A, Pizzi C, Danieli G: A multistep bioinformatic approach detects putative regulatory elements in gene promoters.
BMC Bioinformatics 2005, 6:121136. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Medina C, Rotter B, Horres R, Udupa S, Besser B, Bellarmino L, Baum M, Matsumura H, Terauchi R, Kahl G, Winter P: SuperSAGE: the drought stressresponsive transcriptome of chickpea roots.
BMC Genomics 2008, 9:553. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kim H, Baek K, Lee S, Kim J, Lee B, Cho H, Kim W, Choi D, Hur C: Pepper EST database: comprehensive in silico tool for analyzing the chili pepper (Capsicum annuum) transcriptome.
BMC Plant Biology 2008, 8:101108. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Zhao Y, Li Q, Yao C, Wang Z, Zhou Y, Wang Y, Liu L, Wang Y, Wang L, Qiao Z: Characterization and quantification of mRNA transcripts in ejaculated spermatozoa of fertile men by serial analysis of gene expression.
Human Reproduction 2006, 21(6):15831590. PubMed Abstract  Publisher Full Text

Metta M, Gudavalli R, Gibert J, Schlötterer C: No Accelerated Rate of Protein Evolution in MaleBiased Drosophila pseudoobscura Genes.
Genetics 2006, 174:411420. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Morin R, O'Connor M, Griffith M, Kuchenbauer F, Delaney A, Prabhu A, Zhao Y, McDonald H, Zeng T, Hirst M, Eaves C, Marra M: Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells.
Genome Research 2008, 18:610621. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Borecký J, Nogueira F, de Oliveira K, Maia I, Vercesi A, Arruda P: The plant energydissipating mitochondrial systems: depicting the genomic structure and the expression profiles of the gene families of uncoupling protein and alternative oxidase in monocots and dicots.
Journal of Experimental Botany 2006, 57(4):849864. PubMed Abstract  Publisher Full Text

Lin C, Mueller L, Carthy JM, Crouzillat D, Pétiard V, Tanksley S: Coffee and tomato share common gene repertoires as revealed by deep sequencing of seed and cherry transcripts.
Theor Appl Genet 2005, 112:114130. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cervigni G, Paniego N, Pessino S, Selva J, Diaz M, Spangenberg G, Echenique V: Gene expression in diplosporous and sexual Eragrostis curvula genotypes with differing ploidy levels.

Miles J, Blomberg A, Krisher R, Everts R, Sonstegard T, Tassell CV, Zeulke K: Comparative Transcriptome Analysis of In Vivoand In VitroProduced Porcine Blastocysts by Small Amplified RNASerial Analysis of Gene Expression (SARSAGE).
Molecular Reproduction and Development 2008, 75:976988. PubMed Abstract  Publisher Full Text

Evans R, Boersma J: The Entropy of a Poisson Distribution.
SIAM Review 1988, 30(2):314317. Publisher Full Text