Abstract
Background
Determining whether a gene is differentially expressed in two different samples remains an important statistical problem. Prior work in this area has featured the use of ttests with pooled estimates of the sample variance based on similarly expressed genes. These methods do not display consistent behavior across the entire range of pooling and can be biased when the prior hyperparameters are specified heuristically.
Results
A twosample Bayesian ttest is proposed for use in determining whether a gene is differentially expressed in two different samples. The test method is an extension of earlier work that made use of point estimates for the variance. The method proposed here explicitly calculates in analytic form the marginal distribution for the difference in the mean expression of two samples, obviating the need for point estimates of the variance without recourse to posterior simulation. The prior distribution involves a single hyperparameter that can be calculated in a statistically rigorous manner, making clear the connection between the prior degrees of freedom and prior variance.
Conclusion
The test is easy to understand and implement and application to both real and simulated data shows that the method has equal or greater power compared to the previous method and demonstrates consistent Type I error rates. The test is generally applicable outside the microarray field to any situation where prior information about the variance is available and is not limited to cases where estimates of the variance are based on many similar observations.
Background
Determining whether a gene is differentially expressed under different conditions is an important statistical problem [13]. Consider, for example, a common microarray experiment: for a particular gene Y, one has an unordered set of measurements of logexpression levels in a control situation and in a treatment situation (or any noncontrol situation) (notation similar to [4]). The question of interest is whether any expression difference is a significant difference, or if it would be expected under the null hypothesis of no actual difference in expression.
Though multifactorial experimental designs are becoming increasingly popular [57], there continue to be experimentalists interested in analyzing twosample designs. There are many approaches to determining whether genes are differentially expressed in such designs and there are a number of excellent reviews of the various methods [2,3]. Many of the approaches include the use of a ttest [814], which is a common frequentist statistical approach to comparing a difference in sample means. Some have pointed out that the normality assumption used in ttests may not always hold and other statistical techniques may be warranted [15,16]. Even for those measurements where normality holds the ttest is difficult to apply because the number of replicates n_{1 }and n_{2 }are often quite small (typically, n_{1 }= n_{2 }= 1 to 3), leading to uncertainty in the sample variance and relatively low power. An early approach to dealing with this problem was based on the addition of a constant value to the estimate of the standard deviation [16,17]. While the approach does tend to regularize the variance estimates, it is ad hoc and is not expected to exhibit statistically consistent behavior (unknown or incorrect Type I error rate under the null).
More rigorous Bayesian methods have been developed that incorporate prior information about the variance but require Markov chain sampling of the posterior density or numerical integration of the cumulative distribution function [1821]. A very popular work by Baldi and Long [4] avoids such calculations while using a statistically justified regularization technique via construction of a probabilistic Bayesian framework that applies a prior probability to parameters of interest such as the expression means and variances. An analytically tractable solution can then be obtained by taking a point estimate for the variance. The chief advantages of this Bayesian framework are twofold: it allows the use of a regularized ttest to determine whether a difference in expression is significant, and it provides a natural method for incorporating information from putatively related measurements. It is this basic framework that we seek to extend in the work presented here.
As noted by Baldi and Long [4], the use of a point estimate for the variance is a compromise between a rigorous Bayesian approach and tractable calculation. As demonstrated below, this compromise can lead to bias in the specificity of the test and improper behavior when there are few prior degrees of freedom. Our hope here is to correct this and other issues (both theoretical and practical) towards the goal of improving the performance and understanding of microarray and general laboratory data analysis. To address these issues, we extend the basic framework put forth earlier, taking the model to a fully Bayesian approach by explicitly calculating the marginal posterior distribution for the difference in mean expression levels. This analytically tractable solution does not require simulation from the posterior distribution and obviates the need to use point estimates of the prior variance while overcoming the undesirable correlation between specificity and prior degrees of freedom. Moreover, a clear and statistically rigorous connection between the prior variance and the prior degrees of freedom is established, reducing the number of hyperparameters from two to one while yielding consistent Type I error rates in the process. It should be stressed that achieving consistent Type I error rates is not just a conceptual luxury. Indeed proper determination of important quantities such as the false discovery rate are of great economical significance and rely on having correct Type I error rates [22].
Method
Bayesian probability model
The original model proposed by Baldi and Long [4] is briefly reviewed here. The likelihood of the observed data, y, given the true expression level, μ, and variance, σ^{2}, for a single gene is assumed to follow a normal distribution:
where n is the number of measurements in one sample. For the priors on μ and σ^{2 }the authors attempt to capture the a priori dependence between μ and σ^{2 }by factoring the joint prior as p(μ,σ^{2}) = p(μσ^{2})p(σ^{2}) and taking each factor as
p(μσ^{2}) = N(μ;μ_{0},σ^{2}/λ_{0}) (2)
and
p(σ^{2}) = I(σ^{2};ν_{0},) (3)
where the prior probability of μ follows a normal distribution and σ^{2 }follows a scaled inverse gamma distribution. The hyperparameters μ_{0 }and λ_{0 }represent the prior beliefs regarding the value of μ and the confidence associated with that belief, while the hyperparameters and ν_{0 }represent the prior beliefs regarding the value of σ^{2 }and the degrees of freedom or confidence associated with that belief. The authors subsequently let λ_{0 }→ 0 later in their derivation, leading to a diffuse, noninformative prior on μ:
The remaining 1/σ factor that comes from a diffuse normal results in a prior of the form p(μ,σ^{2}) ∝ σ^{1}I(σ^{2};ν_{0},). Using Bayes' rule the authors obtain the following posterior distribution:
p(μ,σ^{2}y,α) ∝ N(μ;μ_{n},σ^{2}/λ_{n})I(σ^{2};ν_{n},) (5)
where α is a vector of hyperparameters (μ_{0},λ_{0},ν_{0},) for the prior and
λ_{n }= λ_{0 }+ n (7)
ν_{n }= ν_{0 }+ n (8)
The sufficient statistics are and . Finally, the authors calculate the posterior mean estimate or point estimate (PE) of the variance for one sample as
This estimate is then used in the hypothesis testing procedures implemented in their CyberT software. Though this model is developed using a Bayesian framework to make a point estimate of the variance, the hypothesis testing is carried out using a classical frequentist ttest.
In the standard ttest, the variance is still defined when there are no prior degrees of freedom, but inspection of Eq. (10) shows that when the number of prior degrees of freedom ν_{0 }< 2  n, the variance is undefined. This limiting behavior stems from the use of a point estimate for the variance. The total number of prior degrees of freedom used in the CyberT software is actually 2ν_{0 }as it counts ν_{0 }prior degrees of freedom for each sample in the twosample test. However, for consistency and clarity we hereafter refer to the total number of prior degrees of freedom as ν_{0}. Therefore the total number of degrees of freedom used in the CyberT software for the twosample hypothesis test then appears to be ν_{ttest }= ν_{0 }+ n_{1 }+ n_{2 } 4, where n_{1 }is the number of experimental replicates of the gene in the control regime and n_{2 }is the number of replicates of the gene in the treatment regime. This equation for the total degrees of freedom also demonstrates undesired limiting behavior when ν_{0 }= 0, since positive degrees of freedom exist only when n_{1 }+ n_{2 }> 4. This result contradicts what we would expect to see from a classic twosample ttest with no prior estimate of the variance, i.e. a classical twosample ttest has degrees of freedom n_{1 }+ n_{2 } 2 and positive degrees of freedom exist when n_{1 }+ n_{2 }> 2.
Full twosample model
Posterior distribution
Instead of using a point estimate for the variance to conduct a frequentist ttest, we can recast the problem to directly infer the posterior distribution for a twosample case:
p(μ,Δμ,σ^{2}y) ∝ p(μ,Δμ,σ^{2})p(yμ,Δμ,σ^{2}) (11)
where μ is the mean expression level of the gene in the control regime, μ +Δμ the mean of the gene in the alternate experimental regime, and σ^{2 }the variance associated with the measurements. In contrast to the previous, point estimate model, in this formulation there is no explicit prior dependence between σ^{2 }and μ. While this is admittedly a simplification, this independence is not only a practical requirement for the full Bayesian integration, we also believe it is justified on several other grounds. First, while it has been widely observed that genes with different expression levels often have different variances [1], it is not clear how the prior used by Baldi and Long [4] to derive the point estimate of variance could capture the dependencies observed with actual microarray data. Their figures show that the logarithm of the mean expression level tends to decrease with increasing variance, yet the prior given by Eq. (2) asserts that the prior probability of the mean should be more diffuse – not necessarily smaller – with increasing variance. The observed trend between mean and variance cannot be captured by such a prior. Second, their method does include an implicit dependence between mean and variance by using other gene expression levels to determine the priors; indeed, this is what lends the method its power. Because the Bayesian formulation proposed here determines the prior hyperparameters in a similar fashion, it retains this implicit dependence. One may also consider a number of variance stabilizing transforms that may be applied in order to achieve constant variance across the expression spectrum [23,24]. Finally, as shown below, simulations reveal that the method still retains power even when the true variance is allowed to vary over a substantial range.
Returning to the posterior distribution, we assume that the samples from each experimental regime follow a normal distribution, each with equal variances and (possibly) different means.
y_{i }~ N(μ,σ^{2}) (12)
y_{j }~ N(μ + Δμ,σ^{2}) (13)
This leads to the following posterior distributionp
where n_{1 }and n_{2 }are the number of measurements in each sample. We elect not to make use of the prior on μ used in the Baldi work, Eq. (2), or its limiting form, Eq. (4). Instead, we assume a completely flat prior on both μ and Δμ while retaining the prior for σ^{2 }given by Eq. (3). Similar to the earlier work, the observed dependence between μ and σ^{2 }is established by calculating based on the variance of similarly expressed genes. However, in this formulation, instead of taking the average standard deviation of similarly expressed genes we choose to estimate the prior variance by totaling the sumsquared differences for each similar gene and dividing by the total prior degrees of freedom. This is a more statistically rigorous way of incorporating prior information and leads to more a more consistent test as will be discussed below.
Marginal posterior for Δμ
With the above definitions and assumptions in place, it can be shown (see Appendix) that the marginal posterior distribution follows a t distribution:
where
ν_{n }= n_{1 }+ n_{2 }+ ν_{0 } 2 (16)
The sufficient statistics are , and the sum squared differences (n_{1 } 1) and (n_{2 } 1), where and . This distribution has the feature that the standard ttest is obtained even when there are no prior degrees of freedom, ν_{0} = 0, in which case the observed variance for each gene dominates the posterior density. A hypothesis test is performed by asserting a null hypothesis that the true difference in expression levels is zero, i.e. Δμ = 0 (though other values for the true difference, Δμ, can be used when attempting to identify genes that are differentially expressed by some threshold degree). When the posterior probability of no differential expression approaches zero, Pr(Δμ = 0y) → 0, the null is rejected. It is worth noting that while the posterior probability distribution for no differential expression, Pr(Δμ = 0y), follows the same frequentist distribution for the data, Pr(yΔμ = 0) [25], the resulting probabilities have different interpretations. In the Bayesian approach the posterior probability is interpreted as the probability that there is no difference in expression levels. In the frequentist approach the probability is interpreted as the probability of observing such a difference from the null distribution of no differential expression given chance alone. When Δμ = 0 the Bayesian and frequentist methods give the same numerical results and one can ignore the interpretational differences when discussing such things as the false positive rate under the null. However, power analysis calculations (Δμ ≠ 0) are best conducted under a frequentist framework using a noncentral tdistribution [25].
Estimating the prior variance
For m similarly expressed genes each having n replicates, the prior degrees of freedom for the variance can be calculated as
ν_{0 }= m(n  1) (18)
The prior variance is then calculated as the total sumsquare differences for each sample of similarly expressed genes divided by the prior degrees of freedom, namely
where is the mean response of gene k and y_{k,i }is the response i of gene k. In general the prior degrees of freedom and variance should be determined using these equations and not chosen separately from one another. While it is tempting to think of the prior variance as separate from the precision or credibility associated with that estimate (as represented by ν_{0}), a consistent hypothesis test requires that they be considered together based on the actual prior data collected.
Results
To assess the performance of the new test, a series of simulations were done to measure the false positive rate and statistical power under various conditions. The simulations consisted of the following steps:
1) Draw random samples from two normal distributions of size n_{1 }and n_{2 }distributed as N(μ_{1},) and N(μ_{2},) respectively. This step represents the selection of two different experimental regimes for a gene of interest in order to test for differential expression. For power testing, the standardized effect (μ_{1 } μ_{2})/σ_{test }= 2 is used to generate the data by setting μ_{1 }= 2, μ_{2 }= 0, and σ_{test }= 1. For false positive rate testing μ_{1 }= 0.
2) Draw an estimate of the prior variance from the following Chisquare distribution: . This step simulates the task of estimating the prior variance from similarly expressed genes. It is mathematically equivalent to calculating the sumsquared differences of similarly expressed genes and dividing by their prior degrees of freedom, but it is computationally more efficient.
3) Perform hypothesis tests using a) the method used by Baldi and Long [4] based on the mean posterior point estimate for the variance, PE(σ^{2}) or just PE, and b) the new test proposed here based on the marginal posterior for the difference in expression level, MP(Δμ) or just MP. The hypothesis tests using each method are done at the α = 0.05 level using a twosided, twosample ttest. Test pvalues below this level are counted as significant. To make valid comparisons, both methods used the same number of prior degrees of freedom.
4) Steps 1 to 3 above are repeated 10,000 times.
Results of the power simulations for a number of sample sizes and prior degrees of freedom are shown in Table 1. Because the PE test cannot be used when ν_{0 }+ n_{1 }+ n_{2 }≤ 4, the test has no power and no false positive rate when n_{1 }= n_{2 }= 2 and ν_{0 }= 0. Conversely, the MP test results in a classical ttest in this case and has a power of 0.2169 and a false positive rate of 0.0509, which is near the expected value given the α = 0.05 level of the test. Holding n_{1 }= n_{2 }= 2 but increasing the number of prior degrees of freedom to ν_{0 }= 2 (equal to 2 other similarly expressed genes used to determine the prior variance) the PE test can now be used but has a very low power of 0.0450, compared to the MP power of 0.3343. Overall, when the estimate of the variance is balanced between the prior and observed variances (i.e. the prior number of degrees of freedom is not large compared to the number of replicates in the test samples) the MP test is significantly more powerful. The difference in power between the two methods is less pronounced when ν_{0 }= 16 (the default value used in the CyberT software when n_{1 }= n_{2 }= 2). However, one should note that such a relatively high number for the prior degrees of freedom represents a strong degree of belief in the prior estimate of the variance relative to the variance in the genes of interest.
Table 1. Comparison of false positive rate and statistical power for fixed α. For a given significance level (α) the observed false positive rate (FP) and power were determined by simulation using either a point estimate for the variance (PE) or the marginal posterior distribution for Δμ (MP). Values are given for different sample sizes (n_{1 }= n_{2 }= n) and prior degrees of freedom (ν_{0}).
These simulations reveal that the observed false positive rate for the PE test is substantially out of line with the nominal value of 0.05 for small sample replicates (n_{1 }= n_{2 }≤ 3) and few prior degrees of freedom (ν_{0 }≤ 16). Conversely, the MP test demonstrates consistent behavior for the observed false positive rate. However, it is possible to obtain the desired false positive rate for the PE method by iterating through values of α. Table 2 shows that, by adjusting the critical α value for the PE method until the observed false positive rate matches that of the MP method, both tests have similar power. The disadvantage is that α no longer represents the expected Type I error rate, and simulation or calibration is required whenever a new test is performed. Conversely, with the MP test, the critical value α represents an unbiased estimate of the Type I error rate.
Table 2. Comparison of α and statistical power for iterated false positive rate. The values are the same as those given in Table 1 except the significance level (α) was iterated for the PE method until its power matched that of the MP method where α was held constant.
It is interesting to look for a simple correction to the degrees of freedom in order to make the two methods match each other. Unfortunately, there does not appear to be any simple correction available. The degrees of freedom used in the ttest itself would be easy to modify – though ad hoc, one could change the last term in the expression for the degrees of freedom used in the PE test (ν_{ttest }= ν_{0 }+ n_{1 }+ n_{2 } 4) from 4 to 2 by simply adding two to ν_{0}. Unfortunately, the point estimate of the variance given by Eq. (10) is more problematic as it appears to give an increased estimate of the variance compared to the one used in the MP test, resulting in a decrease in both the tstatistic and the statistical power.
In the CyberT software [4] the prior variance is estimated by averaging the standard deviation over 100 similarly expressed genes. This has two problems associated with it. First, the measured standard deviation is a biased estimate and tends to be smaller than the true value for small sample sizes like n = 2 or 3, the average of these measured standard deviations will also be biased downward and in general , where E(·) represents the expected value of a random variable. The correct way to estimate the variance based on prior observations is to calculate the total sum square differences for all the prior samples and divide by the total degrees of freedom, as shown in Eq. (19). Second, if similarly expressed genes for each of the two samples are used to estimate the prior variance and they have the same number of replicates, n, as the samples themselves, then estimating the standard deviation from m = 100 similarly expressed genes is tantamount to asserting ν_{0 }= 100·(n  1) prior degrees of freedom. Setting ν_{0 }to less than the actual number of prior degrees of freedom used to calculate the prior variance results in a lower actual false positive rate than the rejection level α. For consistency if we fix the number of prior degrees of freedom that are used in the test, then we must choose a corresponding number of similarly expressed genes that together have the same number of prior degrees of freedom. This dependence between ν_{0 }and m is more consistent with the formulation of the prior itself and leads to consistent Type I error rates. When we calculate the prior variance in the statistically rigorous manner (by taking the total sum square differences of prior observations and dividing by the total prior degrees of freedom) the PE method becomes overly conservative, generating false positives with a rate of 0.0044 at n_{1 }= n_{2 }= 2, α = 0.05, and ν_{0 }= 2, this appears to be the result of using a point estimate for the prior variance that is larger than the expected value given similar genes, see Eq. (10).
As mentioned previously, to perform the full Bayesian integration required by the MP test, the assumption of a constant variance is required. To test the effect of violating this assumption, simulations were performed as described above, but in step 2 the variances for each sample were drawn from a uniform distribution in the range [0.05,1]. The same uniform distribution was used to generate prior variances by calculating the total sumsquared differences of m = ν_{0}/(n  1) different draws (corresponding to m different genes) for the variance and dividing by the prior degrees of freedom, ν_{0}. The results for both the PE and MP methods are shown in Table 3, where the observed false positive rate for the MP method is used to calibrate the α value for the PE test so as to achieve nearly the same false positive rate for each method. Though both methods are seen to have nearly the same power for a given false positive rate, the PE method again shows large deviations between α and the observed false positive rate. However, the MP method shows more consistency between the observed and theoretical rates and is fairly robust to violations of the constant variance assumption.
Table 3. Comparison of α and statistical power for iterated false positive rate and variable variance. The values are the same as those given in Table 1 and Table 2 except the variance for each sample was drawn from a uniform distribution between 0.05 and 1 during the simulation. The significance level (α) for the PE method was iterated so that the observed false positive rate matched that of the MP method where α was held constant.
To test whether these statistical improvements in the method translate to improvements in experimental data analysis, both methods were evaluated with the same microarray data [26] used by Baldi and Long [27]. The Arfin data matrix is composed of 8 columns by 1973 rows that contain gene expression levels used to study IHF regulated genes in Escherichia coli. Columns 1 to 4 correspond to 4 IHF^{+ }strains while columns 5 to 8 correspond to 4 IHF^{ }strains. The 1973 rows correspond to the measured gene expression levels for each strain. There are actually more rows in the original data set but to be consistent with the earlier work, we only chose rows that had measurable expression levels for all 8 runs. As was done in the Long paper the 4 control and 4 treatment replicates were partitioned into 12 subsets that differ by at least two replicates so that quasiindependent pairwise comparisons could be made (i.e. 12v56, 12v78, 13v57, 13v68, 14v58, 14v67, 23v58, 23v67, 24v57, 24v68, 34v56, and 34v78). For the experimental evaluation we adopted the same strategy for pooling variances used by Baldi and Long. We sorted all the genes within each sample by the mean expression level and computed the prior variance by considering a window of the m/2 most similarly expressed genes around each gene of interest. For the two samples this corresponds to a total of m similarly expressed genes and Eq. (18) gives the total prior degrees of freedom, ν_{0 }= m.
To measure the consistency of a given testing method the number of common genes discovered between two quasiindependent subsets can be recorded. Genes that are discovered to be in common between subsets are more likely to be truly differentially expressed, thus higher a commonality percentage (genes discovered in common divided by the total number of genes discovered) is an indication that the testing method is more reliable. The number of common genes discovered can be recorded for the 66 possible ways of comparing the 12 subsets. Based on pvalue rankings, the MP method achieved similar results (within sampling error) to those published using the CyberT software. The single hyperparameter (ν_{0}) was optimized for the MP method and a 95% confidence interval of [62.4, 67.8] common genes were identified out of the top 120. This compares well with the published results using CyberT where optimizing two parameters independently (ν_{0 }and m) achieved an average common gene count of 67. These results represent a substantial improvement over the 33 common genes identified using a zero parameter classical ttest.
Because the pvalue rankings by themselves are not actually statistical tests, a more stringent approach to assessing the performance of each method was accomplished by identifying genes that are differentially expressed at some confidence level α. For the following tests we used a pvalue of 0.01 for the null hypothesis cutoff (uncorrected for multiple tests). The results are presented in Table 4. In general, for a given number of prior degrees of freedom, the MP method consistently detects more genes as differentially expressed than does the PE method and more of these genes are commonly discovered when comparing them with quasiindependent subsets. In addition the fraction of common genes within those detected is generally higher for the MP method. The difference between the methods tends to decrease as the number of prior degrees of freedom is increased. The optimal performance (in terms of % commonality) is achieved at around ν_{0 }= 400, which corresponds to basing the local variance on about 10% of the 1972 similarly expressed genes within each sample (there are 3944 similarly expressed genes when considering both samples). As ν_{0 }increases past 400 the commonality percentage drops somewhat. Basing the variance on the combined estimated of all 3944 similar genes results in a differenceinmeans test, and the drop in performance is similar to that seen by Long and Baldi when a simple difference in means was used to rank the expression differences [27].
Table 4. Experimental comparison of hypothesis testing methods. The microarray data consists of 1973 genes partitioned into 12 quasi independent n_{1 }= 2 and n_{2 }= 2 subsets using replicate experiments of 4 control and 4 treatment runs [26, 27]. The number of genes detected as differentially expressed using each method is listed for an α = 0.01 significance level (there was no correction for multiple testing) and different prior degrees of freedom (ν_{0}). The average number of common genes for the 66 unique comparisons of the 12 subsets is also listed along with the % of commonality (common/detected) for each test.
It is important to stress that the above results for the PE method are not what one would expect using the implementation found in the CyberT software [4]. As mentioned before, this is because the existing software incorrectly calculates the pooled variance by averaging the standard deviation of similarly expressed genes. This results in a substantial error for example, when n_{1 }= n_{2 }= 2 and ν_{0 }= 400, the Type I error rate is approximately 0.125 (2.5 times higher than that predicted by the nominal rate α = 0.05). The authors of CyberT recommend smaller values of ν_{0 }where the overestimate of variance given by their point estimate, Eq. (10), is balanced by the underestimate of variance owing to incorrect pooling, thereby achieving (approximately) correct false positive rates by offsetting two competing biases.
Discussion and conclusion
The new marginal posterior (MP) formulation proposed here is uniformly at least as powerful as the previous point estimate (PE) method which it is based on and more powerful when the number of similarly expressed genes (and therefore prior degrees of freedom ν_{0}) is small. Moreover, unlike the existing regularization methods, the new formulation consistently maintains the proper false positive rates under the null hypothesis across the entire range of pooling. The MP formulation also makes clear the dependence between the prior degrees of freedom and prior variance and thus methods to estimate the single hyperparameter ν_{0 }are expected to be more robust and interpretable.
As mentioned previously, the key to the power of these Bayesian methods is the use of similarlyexpressed genes to estimate the expected variance. Naturally, the more similar the variance in expression of the other genes is to the gene of interest, the greater the statistical consistency of the method. In generating the simulated data we have attempted to remain agnostic on the definition of "similarlyexpressed", using variances that are either constant or drawn from a uniform distribution. Another Bayesian method [28] utilizes a clustering technique based on a series of tstatistics to group genes into prior categories in order to pool their prior variances. Because the Bayesian posterior probabilities used by that method are not interpretable as frequentist Type I error rates, it is difficult to set the desired false positive rate using such a method without resorting to simulation and iteration. Other clustering strategies include those based on nonparametric regression to obtain local variance estimates [29] and mixture models that group together similar variances [30] using maximum likelihood. These clustering strategies assume the variance is known based on the large number of observations in each group. This allows normal distributions to be used in deriving test statistics. Unfortunately, when the number of similarly expressed genes is small the use of a normal distribution is not justified as it assumes there is no uncertainty in the variance. The advantage of the MP formulation is that the normal approximation for the test statistic is not required – it is statistically robust across the entire range of pooling or clustering strategies (assuming the strategies work properly to group together genes of similar variance). In our experimental application of the method we have adopted a pooling strategy based on grouping together genes with similar expression levels. Thus the MP formulation can be used in situations where two or more control replicates can be compared with only single observations under treatment conditions; this is not possible with clustering strategies that require at least two replicates in each sample [30].
Though the MP test was developed in a Bayesian framework, there is a strong analogy with frequentist statistics. One person's prior is another person's data; if we treat the prior estimates of variance as just additional data collection then our hypothesis test using the t distribution is the same one as that used in any introductory statistics course. In either situation, one considers the estimate of variance as the total sumofsquared differences for all observations of the variance divided by the total degrees of freedom. By "all observations" we mean both the two samples involved in the hypothesis test plus some number of replicates from other genes we believe to be similarly expressed and are good representatives of the variance. This is just like estimating the method standard deviation and conducting a normal ttest with that estimate along with the additional degrees of freedom that come with it. All of the power analysis calculations can then be done without recourse to simulations; for example, builtin functions in the desktop statistical package R (using the noncentral t distribution) gave the same results we present by simulation for the MP method. It should be stressed that this Bayesian hypothesis test is not limited in any way to the analysis of microarray data. It can be applied to any experimental situation where prior beliefs about the variance associated with small sample sizes are used to achieve higher statistical power while maintaining consistent Type I error rates. This is particularly important when only a few similar observations of the variance are available, for example when only a handful of genes are under study across a range of treatment conditions.
The method is implemented in R and is freely available for download [31] or by contacting the corresponding author.
Authors' contributions
RJF conceived of and derived the statistical formulation, applied the method to various data sets and drafted the manuscript. MWD contributed major revisions to the manuscript and suggested ways of improving the study. Both authors contributed to the final version of the manuscript and approved it.
Appendix
Determining the marginal posterior distribution of Δμ
The posterior is the product of the prior and the likelihood:
p(μ,Δμ,σ^{2}y) ∝ p(μ,Δμ,σ^{2})p(yμ,Δμ,σ^{2}) (20)
The likelihood consists of the product of two normal distributions:
y_{i }~ N(μ,σ_{2}) (21)
y_{j }~ N(μ + Δμ,σ^{2}) (22)
Combining the joint likelihood into a common exponential gives:
Next we can define a variable that that only contains terms for the expression levels:
Expanding the square leads to the following:
Next, we look to complete the square by adding and subtracting the same value k:
Now collapsing back to a single squared term in the exponent and defining a new variable, C_{σ}, we have
C_{μ }= (n_{1 }+ n_{2})(μ  k)^{2 }+ C_{σ } (33)
where
The posterior density can then be written as
We can now look to integrate out the true expression level, μ:
We have flat priors on μ and Δμ, allowing us to complete the integration over all μ.
p(μ,Δμ,σ^{2}) ∝ p(σ^{2}) (37)
The posterior for the joint density of Δμ and σ^{2 }is then
We next look to integrate out σ^{2 }by defining the sufficient statistics associated with the variance:
where
The last term in Eq. (43) CΔμ, now contains all references to the null difference in means, Δμ. The term can be expanded by multiplying out k^{2 }and then simplified:
This can be inserted back into Eq. (43) to give
The prior for variance is said to follow a scaled inverse Chisquare distribution:
The posterior density is then
But now we can integrate out σ^{2}
With a change of variables
The posterior density then becomes the well known t distribution.
where
ν_{n }= n_{1 }+ n_{2 }+ ν_{0 } 2 (55)
Acknowledgements
We would like to thank Magda Bartilson and Stephen DelCardayre for giving us the inspiration to develop the method. We also thank John Grate, Lori Giver and John Whitman for their generous support and feedback and for reviewing various versions of the manuscript.
References

Chen Y, Dougherty ER, Bittner ML: Ratiobased decisions and the quantitative analysis of cDNA microarray images.
J Biomed Optics 1997, 2(4):364374. Publisher Full Text

Cui X, Churchill G: Statistical tests for differential expression in cDNA microarray expriments.
Genome Biology 2003, 4(4):210.0210.1. BioMed Central Full Text

Nadon R, Shoemaker J: Statistical issues with microarrays: processing and analysis.
Trends in Genetics 2002, 18(5):265271. PubMed Abstract  Publisher Full Text

Baldi P, Long A: A Bayesian framework for the analysis of microarray expression data: regularized ttest and statistical inferences of gene changes.
Bioinformatics 2001, 17(6):509519. PubMed Abstract  Publisher Full Text

Townsend J: Multifactorial experimental design and the transitivity of ratios with spotted DNA microarrays.
BMC Genomics 2003., 4(41) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vinciotti V, Khanin R, xD'Alimonte R, Liu X, Cattini N, Hotchkiss G, G. B, de Jesus O, Rasaiyaah J, Smith CP, Kellam P, Wit E: An experimental evaluation of a loop versus a reference design for twochannel microarrays.
Bioinformatics 2005, 21(4):492501. PubMed Abstract  Publisher Full Text

Yang YH, Speed T: Design issues for cDNA microarray experiments.
Nat Rev Genet 2002, 3(8):579588. PubMed Abstract  Publisher Full Text

Orian A, van Steensel B, Delrow J, Bussemaker HJ, Li L, Sawado T, Williams E, Loo LW, Cowley SM, Yost C, Pierce S, B.A. E, Parkhurst SM, Eisenman RN: Genomic binding by the Drosophila Myc, Max, Mad/Mnt transcription factor network.
Genes Dev 2003, 17(9):11011114. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sato N, Sanjuan IM, Heke M, Uchida M, Naef F, Brivanlou AH: Molecular signature of human embryonic stem cells and its comparison with the mouse.
Dev Biol 2003, 260(2):404413. PubMed Abstract  Publisher Full Text

Tompa R, McCallum CM, Delrow J, Henikoff JG, van Steensel B, Henikoff S: Genomewide profiling of DNA methylation reveals transposon targets of CHROMOMETHYLASE3.
Curr Biol 2002, 12(1):6568. PubMed Abstract  Publisher Full Text

Hu L, Wang J, Baggerly K, Wang H, Fuller GN, Hamilton SR, Coombes KR, Zhang W: Obtaining reliable information from minute amounts of RNA using cDNA microarrays.

Gu J, Gu X: Induced gene expression in human brain after the split from chimpanzee.
Trends Genet 2003, 19(2):6365. PubMed Abstract  Publisher Full Text

Pavlidis P, Li. Q, Noble WS: The effect of replication on gene expression microarray experiments.
Bioinformatics 2003, 19(13):16201627. PubMed Abstract  Publisher Full Text

Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A: False discovery rate, sensitivity and sample size for microarray studies.
Bioinformatics 2005, 21(13):30173024. PubMed Abstract  Publisher Full Text

Thomas JG, Olson JM, Tapscott SJ, Zhao LP: An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles.
Genome Res 2001, 11(7):12271236. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data.
J Comp Biol 2001, 8(1):3752. Publisher Full Text

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci U S A 2000, 98(9):51165121. Publisher Full Text

Theilhaber J, Bushnell S, Jackson A, Fuchs R: Bayesian estimation of foldchanges in the analysis of gene expression: the PFOLD algorithm.
J Comp Biol 2001, 8(6):585614. Publisher Full Text

Townsend J: Resolution of large and small differences in gene expression using models for the Bayesian analysis of gene expression levels and spotted DNA microarrays.
BMC Bioinformatics 2004., 5(54) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Townsend JP, Hartl DL: Bayesian analysis of gene expression levels: statistical quantification of relative mRNA level across multiple strains or treatments.
Genome Biol 2002., 3(12) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH: Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects.
Nucleic Acids Res 2001, 29(12):25492557. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures.
Bioinformatics 2003, 19(3):368375. PubMed Abstract  Publisher Full Text

Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression.
Bioinformatics 2002, 18(Suppl. 1):S96S104. PubMed Abstract  Publisher Full Text

Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variancestabilizing transformation for geneexpression microarray data.
Bioinformatics 2002, 18(Suppl. 1):S105S110. PubMed Abstract  Publisher Full Text

DeGroot MH, Schervish MJ: Probability and Statistics. Addison Wesley; 2002.

Arfin S, Long D, Ito E, Tolleri L, Riehle M, Paegle E, Hatfield GW: Global gene expression profiling in Escherichia coli K12.
J Biol Chem 2000, 275(38):2967229684. PubMed Abstract  Publisher Full Text

Long D, Mangalam H, Chan B, Tolleri L, Hatfield GW, Baldi P: Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework.
J Biol Chem 2001, 276(23):1993719944. PubMed Abstract  Publisher Full Text

Gottardo R, Pannucci J, Kuske C, Brettin T: Statistical analysis of microarray data: a Bayesian approach.
Biostatistics 2003, 4(4):597620. PubMed Abstract  Publisher Full Text

Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee J: Localpoolederror test for identifying differentially expressed genes with a small number of replicated microarrays.
Bioinformatics 2003, 19(15):19451951. PubMed Abstract  Publisher Full Text

Delmar P, Robin S, Daudin JJ: VarMixt: efficient variance modeling for the differential analysis of replicated gene expression data.
Bioinformatics 2005, 21(4):502508. PubMed Abstract  Publisher Full Text

Fox RJ, Dimmic MW: A Bayesian twosample ttest. [http://www.dimmic.net/supplement/] webcite
2006.