Skip to main content
  • Methodology Article
  • Open access
  • Published:

Confident difference criterion: a new Bayesian differentially expressed gene selection algorithm with applications

Abstract

Background

Recently, the Bayesian method becomes more popular for analyzing high dimensional gene expression data as it allows us to borrow information across different genes and provides powerful estimators for evaluating gene expression levels. It is crucial to develop a simple but efficient gene selection algorithm for detecting differentially expressed (DE) genes based on the Bayesian estimators.

Results

In this paper, by extending the two-criterion idea of Chen et al. (Chen M-H, Ibrahim JG, Chi Y-Y. A new class of mixture models for differential gene expression in DNA microarray data. J Stat Plan Inference. 2008;138:387–404), we propose two new gene selection algorithms for general Bayesian models and name these new methods as the confident difference criterion methods. One is based on the standardized differences between two mean expression values among genes; the other adds the differences between two variances to it. The proposed confident difference criterion methods first evaluate the posterior probability of a gene having different gene expressions between competitive samples and then declare a gene to be DE if the posterior probability is large. The theoretical connection between the proposed first method based on the means and the Bayes factor approach proposed by Yu et al. (Yu F, Chen M-H, Kuo L. Detecting differentially expressed genes using alibrated Bayes factors. Statistica Sinica. 2008;18:783–802) is established under the normal-normal-model with equal variances between two samples. The empirical performance of the proposed methods is examined and compared to those of several existing methods via several simulations. The results from these simulation studies show that the proposed confident difference criterion methods outperform the existing methods when comparing gene expressions across different conditions for both microarray studies and sequence-based high-throughput studies. A real dataset is used to further demonstrate the proposed methodology. In the real data application, the confident difference criterion methods successfully identified more clinically important DE genes than the other methods.

Conclusion

The confident difference criterion method proposed in this paper provides a new efficient approach for both microarray studies and sequence-based high-throughput studies to identify differentially expressed genes.

Background

In the past decade, high-throughput molecular technologies have gained great popularity in gene expression profiling due to their capability of producing thousands of measurements for each of the assayed samples. The microarray technology and next-generation sequencing are two widely used high-throughput technologies. Next-generation sequencing improves upon Sanger dideoxy sequencing so that the number of sequencing reactions in a single run can be in millions. For example, in Nature (2008), Bentley et al. [4] and Wang et al. [34] reported the DNA sequence of a Nigerian individual and an Asian individual, respectively. Ley et al. [18] analyzed the genome sequence of a tumor sample. One common scientific question addressed by these high-throughput experiments is to identify the genes with differential expression between two biological conditions. Although the high-throughput technologies offer us rich biological information, they are highly error-prone because many genes are monitored at the same time with a relatively small sample size. Bayesian methods provide a good solution to this problem because they synthesize all the data by borrowing information across different genes and produce more efficient estimators for evaluating the gene expressions. They include linear models in LIMMA [28] where empirical Bayesian methods were used to obtain stable results even with small sample size. A more detailed description of the Bayesian statistical methods for microarray studies can be found in Dudoit et al. [7], Pan [25], and Kuo et al. [15]. Other Bayesian methods for RNA-Seq studies using next generation sequencing were reviewed by Kvam et al. [16] and Soneson and Delorenzi [29].

Yu et al. [36] pointed out that most statistical methods for microarray studies examined the differential expressions by testing on the equality of means of the log-transformed intensities between the treatment and control, which may not be appropriate for data with complex structures (for example, a mixture normal distributions with multiple modes). They proposed a calibrated Bayes factor (CBF) method to evaluate the ratio of the full data marginal likelihood under the alternative hypothesis that a gene is differentially expressed (DE) relative to that for the null hypothesis that a gene is equivalently expressed (EE) between two biological conditions. Although their approach has the potential for handling data with more complicated distributions, the computational cost of their method may increase greatly with the complexity of the model.

Chen et al. [6] employed a class of mixture models with two components to fit the microarray data with two biological conditions. To evaluate the differential expressions for each gene, they proposed a gene selection algorithm, namely the two-criterion method. Specifically, they calculated a posterior probability that there is at least a two-fold change between the mean values of raw intensities under the two considered conditions. Then a gene is declared to be DE if the resulting posterior probability is large (say at least 0.7). Since the posterior probability is readily available once a Markov chain Monte Carlo sample is drawn from the posterior distribution, the gene selection algorithm proposed by them is quite easy to implement and computationally inexpensive. However, their approach does not consider general data distributions as that in the Bayes factor approach given by Yu et al. [36]. Assuming that the data under each biological condition follow a log-normal distribution as in [6], the mean value of raw intensities equals to exp(mean+variance/2) under each condition. Thus, the two-criterion method proposed by Chen et al. [6] that calculates the ratio of two means of the raw intensities depends on not only the difference between the two transformed means but also the difference between their variances. So, when the differences between the means and the differences between the variances are in opposite directions, the Chen et al. method may not be able to detect DE genes. Additionally, their paper neither provides a guidance on controlling the false discovery rate (FDR) nor carries out the performance comparison with other existing methods.

Our goal in this paper is to develop a simple but efficient gene selection algorithm so that it is not only computationally efficient, but also flexible in handling data with a complicated distribution as in Yu et al. [36]. We redevelop the two-criterion method proposed by Chen et al. [6] and construct two new gene selection algorithms for general Bayesian models. One is based on the differences between means and the other is based on both mean differences and variance differences. To differentiate the method proposed in Chen et al. [6], we name our methods as confident difference criterion methods and the two proposed confident difference criterion methods in this paper as Methods I and II. We show that the Method I, which compares the mean expressions from different conditions, is equivalent to the calibrated Bayes factor approach [36] when the raw intensities from two different biological conditions follow log-normal distributions with equal variance. We also address the multiple comparisons issue with a control of the false discovery rate. We further apply the proposed method to carry out analyses of microarray data with more than two conditions as well as sequence-based RNA data.

Method

Model for microarray data

We assume that the data, denoted by D obs , have already been preprocessed with appropriate transformation and normalization. Let T be the total number of biological conditions in the study. The data may contain two biological conditions (T=2) or multiple biological conditions (T>2). The common analytical objective is to detect differentially expressed (DE) genes across different biological conditions.

Let x gtk denote the preprocessed expression intensity of the g th gene in the k th sample under the t th condition for t=1,⋯,T. There are a total of G genes with sample size n gt under condition t. Thus, the data on gene g under each condition can be summarized using a vector: \(\mathbf {X}_{\textit {gt}}=(x_{gt1},\ldots,x_{gtn_{\textit {gt}}})\phantom {\dot {i}\!}\). We assume that the intensity, x gtk , k=1,⋯,n gt , t=1,2,⋯,T, follows a normal distribution \(\mathcal {N}\left (\mu _{\textit {gt}}, \sigma ^{2}_{\textit {gt}}\right)\) independently. The parameters μ gt and \(\sigma ^{2}_{\textit {gt}}\) denote the mean and variance of the intensities of gene g under condition t, respectively. We write the mean intensities as μ gt = μ g + δ gt /T, t = 1,…,T, where \(\sum ^{T}_{t=1}\delta _{\textit {gt}}\,=\,0\). For simplicity, we set \(\delta _{g1}=-\sum ^{T}_{t=2} \delta _{\textit {gt}}\) under the first biological condition. We note that μ g defines the overall mean of the intensities across all biological conditions, and δ gt /T measures the difference in the mean intensity under biological condition t from the overall mean. In a microarray study with two biological conditions (T=2), the mean intensities μ g1 and μ g2 are written as μ g1=μ g −δ g /2 and μ g2=μ g +δ g /2, respectively. When a gene is DE, we expect that the distributions of the data differ at least under two biological conditions.

Hierarchial prior distributions

Noninformative conditionally conjugate priors are specified for all parameters. Specifically, we assume that the mean parameters \(\mu _{g} \sim \mathcal {N}(0,\tau ^{2})\) and \(\delta _{\textit {gt}} \sim \mathcal {N} (0, \omega ^{2})\) for t>1 and any g, and the variance parameters \(\sigma ^{2}_{\textit {gt}} \sim \mathcal {IG}(a_{t},b_{t})\). We set the variance parameters τ 2 and ω 2 in the normal priors to be 100 to obtain relatively noninformative priors. The shape parameter a t in the inverse gamma prior is set to be 2, so that the prior mean of \(\sigma _{\textit {gt}}^{2}\) equals b t . We further let the scale parameter b t follow a conditionally conjugate gamma prior with \(b_{t} \sim \mathcal {G}(c,d)\), where the hyperparameter c is specified as 1 and the hyperparameter \(d \sim \mathcal {IG}(a_{d},b_{d})\), in which a d and b d are both set to be 0.01 in the simulation study and 1 in the real data analysis. Our hierarchical priors for the variance parameters, which are often difficult to estimate, allow for borrowing the information across genes via \(b_{t} \sim \mathcal {G}(c,d)\) as well as biological conditions via \(d \sim \mathcal {IG}(a_{d},b_{d})\). We intend to specify a noninformative inverse-gamma prior for the parameter d. The value of “1" was specified for both a d and b d in the real data analysis since the real data had a smaller sample size than the simulated data in the simulation study. These values of the hyperparameters still led to noninformative priors since the prior mean and variance of d do not exist. However, these values allowed us to borrow a little but not too much information across different biological conditions under comparison.

Conditional posterior distributions

Let \(\bar {x}_{\textit {gt}}\) denote the average intensities of gene g under condition t and also let the vector \(\bar {\mathbf {X}}_{g}=\{\bar {x}_{g1},\cdots,\bar {x}_{\textit {gT}}\}\) denote the average intensities for gene g. Then, \(\bar {\mathbf {X}}_{g}\) follows a multivariate normal distribution with \(\bar {\mathbf {X}}_{g} \sim \mathcal {N}(\mathbf {A}\mathbf {\Theta }_{g},\mathbf {\Sigma }_{g})\), where Θ g =(μ g ,δ g2,⋯,δ gT )′ is a column vector of size T, and Σ g is a diagonal matrix of size T×T with the t th diagonal element \((\mathbf {\Sigma }_{g})_{t,t}=\sigma ^{2}_{\textit {gt}}/n_{\textit {gt}}\). Here A is a T×T matrix, in which all elements in the first column equals one, i.e., A t1=1 for t=1,⋯,T, all but the first element in the first row equals −1/T, i.e., A 1t =−1/T for t=2,⋯,T, all but the first diagonal element equals 1/T, i.e., A t,t =1/T for t=2,⋯,T, and all other elements equal zero. Since the parameters in Θ g independently follow normal prior distributions, then \(\mathbf {\Theta }_{g} \sim \mathcal {N}(\mathbf {0},\mathbf {\Sigma }_{0})\), where 0 is a vector of size T containing all zero’s and Σ 0 is a diagonal matrix with the first diagonal element (Σ 0)1,1=τ 2 and all other diagonal elements equal to ω 2, i.e., (Σ 0) t,t =ω 2 for t=2,⋯,T. Therefore, the conditional posterior distribution of Θ g is a multivariate normal distribution with Θ g ∼N(U g ,B g ), where the inverse of the variance matrix \(\mathbf {B}_{g}^{-1}=\left (\mathbf {A}'\mathbf {\Sigma }_{g}^{-1} \mathbf {A}+\mathbf {\Sigma }_{0}^{-1}\right)\), and the mean vector \(\mathbf {U}_{g}= \mathbf {B}_{g} \mathbf {A}' \mathbf {\Sigma }_{g}^{-1} \bar {\mathbf {X}}_{g}\). The conditional posterior distribution of the variance parameter \(\sigma _{\textit {gt}}^{2}\) is an inverse-gamma distribution with \(\sigma _{\textit {gt}}^{2} \sim \mathcal {IG}\left (a_{t}+\frac {1}{2}n_{\textit {gt}}, b_{t}+\frac {1}{2}\sum _{k=1}^{n_{\textit {gt}}}\left (x_{\textit {gtk}}-\mu _{\textit {gt}}\right)^{2}\right)\). The conditional posterior density of the hyperparameter b t is given by \(g\left (b_{t}|\sigma _{1t}^{2},\cdots, \sigma _{\textit {Gt}}\right) \propto b_{t}^{(\mathrm {G} a_{t}/2+c)}\exp \left (-\frac {b_{t}}{2}\sum {\sigma _{\textit {gt}}^{-2}}\right)\times \left (\sum _{t'\ne t} b_{t}'+b_{t}+b_{d}\right)^{\mathrm {T} c+a_{d}}\). Consequently, we can apply the Gibbs sampling algorithm to sample the parameters b t , \(\sigma ^{2}_{\textit {gt}}\) and Θ g in turn from their respective conditional posterior distributions using the following steps: (1) sample b t for each condition t from its posterior density function \(g(b_{t}|\sigma _{1t}^{2},\cdots, \sigma _{\textit {Gt}})\) via the Metropolis-Hastings algorithm; (2) sample \(\sigma ^{2}_{\textit {gt}}\) given b t and μ gt for each g and t from its inverse gamma posterior distribution with updated parameters \(a_{t}+\frac {1}{2}n_{\textit {gt}}\) and \(b_{t}+\frac {1}{2}\sum _{k=1}^{n_{\textit {gt}}}(x_{\textit {gtk}}-\mu _{\textit {gt}})^{2}\); (3) sample Θ g given \(\sigma ^{2}_{\textit {gt}}\) for all g from their conditional multivariate normal posterior distribution, and calculate the μ gt based on the sampled values of Θ g .

Model for sequence-based data

Let \(\phantom {\dot {i}\!}\textit {\textbf {Y}}_{\textit {gt}}=(y_{gt1},\cdots y_{gtn_{\textit {gt}}})\) denote all n gt observed counts of the expressed tags of gene g under condition t for g=1,⋯,G and t=1,⋯,T. We assume that y gkt follows a negative binomial distribution, which is commonly used for the count data with overdispersion [2, 26]. Specifically, we assume y gtk follows \(\mathcal {NB}\left (\phi _{t},\frac {m_{\textit {tk}}\lambda _{\textit {gt}}}{\phi _{t}+m_{\textit {tk}}\lambda _{\textit {gt}}}\right)\), with mean m tk λ gt and variance \(m_{\textit {tk}}\lambda _{\textit {gt}}\left (1+m_{\textit {tk}}\lambda _{\textit {gt}}\phi _{t}^{-1}\right)\). We set m tk to be the library size of the k th sample under the t th condition, which is the sum of all counts from this library. The dispersion parameter ϕ t is assumed to be positive, accounting for potential over-dispersion in the data. When the dispersion parameter ϕ t gets extremely large, the value of \(\phi _{t}^{-1}\) approaches to zero, and the negative binomial distribution becomes a Poisson distribution with a mean value of m tk λ gt . DE genes are expected to have different λ gt ’s under different biological conditions.

Hierarchical prior distributions

We assume that each dispersion parameter ϕ t follows a gamma distribution, \(\phi _{t} \sim \mathcal {G}(\alpha _{\phi }, \beta _{\phi })\) independently over t and its scale parameter β ϕ follows an inverse gamma distribution with \(\beta _{\phi } \sim \mathcal {IG}(\zeta _{\phi },\eta _{\phi })\). We also assume that each gene expression parameter λ gt follows an inverse gamma distribution with \(\lambda _{\textit {gt}} \sim \mathcal {IG}(\alpha _{\lambda _{t}},\beta _{\lambda _{t}})\), where the scale parameter \(\beta _{\lambda _{t}} \sim \mathcal {G}(\zeta _{\lambda },\eta _{\lambda })\). In our simulation studies, we set all the hyperparameters \(\{\alpha _{\phi },\zeta _{\phi }, \eta _{\phi },\alpha _{\lambda _{t}},\zeta _{\lambda },\eta _{\lambda }\}\phantom {\dot {i}\!}\) to be one.

Conditional posterior distributions

Since a negative binomial distribution can be written as a Poisson-gamma distribution, we can rewrite the distribution of y gtk as \(y_{\textit {gtk}} \sim \mathcal {P}oi(\theta _{\textit {gtk}})\), and \(\theta _{\textit {gtk}} \sim \mathcal {G}\left (\phi _{t},m_{\textit {tk}} \lambda _{\textit {gt}} \phi _{t}^{-1}\right)\). Then we can derive all the conditional posterior distributions for all of the parameters. Specifically, the conditional posterior distribution of θ gtk is a gamma distribution with \(\theta _{\textit {gtk}} \sim \mathcal {G}\left (y_{\textit {gtk}}+\phi _{t},\left [1+\frac {\phi _{t}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\right ]^{-1}\right)\), the kernel of the conditional posterior density of ϕ t is given by \(\prod _{\textit {gk}} \left [\frac {\phi _{t}^{\phi _{t}}}{\Gamma (\phi _{t})} \left (\frac {\theta _{\textit {gtk}}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\right)^{\phi _{t}} \exp \left (-\frac {\theta _{\textit {gtk}}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\phi _{t}\right) \right ] \exp \left (-\frac {\phi _{t}}{\beta _{\phi }}\right)\phi _{t}^{\alpha _{\phi }-1} I(\phi _{t}>0)\), the conditional posterior distribution of λ gt is \(\mathcal {IG}\left (\sum _{k}\phi _{t}+\alpha _{\lambda _{t}}, \beta _{\lambda _{t}}+\sum _{k}\frac {\theta _{\textit {gtk}}\phi _{t}}{m_{\textit {tk}}}\right)\), and the hyperparameters β ϕ , and \(\beta _{\lambda _{t}}\) respectively have the conditional posterior distributions: \(\beta _{\phi } \sim \mathcal {IG}\left (T\alpha _{\phi }+\zeta _{\phi }, \sum _{t}\phi _{t}+\eta _{\phi }\right)\), and \(\beta _{\lambda _{t}} \sim \mathcal {G}\left (G\alpha _{\lambda }+\zeta _{\lambda },1/\left (1/\eta _{\lambda }+\sum _{g} 1/\lambda _{\textit {gt}}\right)\right)\). Let θ t denote a set containing all θ gtk ’s and λ t as a set containing all λ gt ’s for each condition t. We use the Gibbs sampling algorithm to sample parameters \(\phantom {\dot {i}\!}\{\boldsymbol {\theta }_{t}, \boldsymbol {\lambda }_{t}, \beta _{\lambda _{t}}\}\), ∀t, and β ϕ from their conditional posterior distributions. The conditional posterior distribution of ϕ t does not have a known distribution form. These parameters are sampled using the Metropolis-Hastings sampling algorithm from their conditional posterior distributions.

Confident difference criterion

Preliminary

The confident difference criterion method was extended from the two-criterion method, which was firstly proposed by Ibrahim et al. [13] to detect DE genes for microarray studies with two biological conditions. In this two-criterion method, the fold change between two conditions was defined as \(\xi _{g}=\exp \left (\mu _{g2}+0.5\sigma _{g2}^{2}-\mu _{g1}\right.-\) \(\left.0.5\sigma _{g1}^{2}\right)\), and the posterior probabilities of having at least two fold changes between two conditions, denoted as γ g1=P r(ξ g >2|D obs ) and γ g2=P r(ξ g <1/2|D obs ), were evaluated on each gene to quantify the evidence of its differential expression. A gene is declared to be DE genes if the calculated posterior probabilities γ g1 or γ g2 are sufficiently large. The two-criterion method is easy to compute and provides good false positive and false negative rates [6] for identifying DE genes from microarray studies with two biological conditions. However, the posterior probability γ g defined in this confident difference criterion method does not account for the posterior variability of the fold change, and may not work well for the data with multiple conditions due to the potential multiple comparisons problem since only two conditions can be compared at a time.

In this section, we will develop confident difference criterion using a similar idea of the existing two-criterion method to compare mean expressions (Method I) after taking into account the posterior variability of the mean intensity parameters for the microarray data with two biological conditions. Then we extend the newly developed confident difference criterion method for the microarray data with multiple biological conditions. Furthermore, we will develop another version of the confident difference criterion method to compare both means and variances of the expressions (Method II) for the microarray data. Finally, we extend the confident difference criterion method for comparing mean differential expressions of microarray data (Method I) to the analysis of RNA-Seq data (Method I).

Confident difference criterion for the comparison between mean expressions for the microarray data

Microarray study with two conditions

For a study with two biological conditions, μ g2−μ g1 quantifies the difference in the mean intensities of gene g between the two conditions and its conditional posterior distribution follows a normal distribution. We define the posterior probability as

$$ \gamma_{g}=Pr\left(\frac{|\mu_{g2}-\mu_{g1}|}{\sigma_{\mu_{g2}-\mu_{g1}}}>2 \Bigg| D_{obs}\right), $$
((1))

where \(\sigma _{\mu _{g2}-\mu _{g1}}\) is the posterior standard deviation of μ g2−μ g1. Then we select a cutoff value γ 0 (0<γ 0<1) and declare a gene to be DE if its posterior probability γ g is greater than the cutoff value γ 0.

Note that the choice of γ 0 reflects how strong the evidence is for declaring DE genes. When a larger value is specified for γ 0, fewer genes will be selected to be DE. In the two-criterion method, Chen et al. [6] recommended to use a large cutoff value (ranging between 0.7 and 0.9) because they did not adjust for the posterior variability of the fold change when comparing the gene intensities between the two conditions. After adjusting for the posterior variability, γ g in (1) is quite different than the corresponding posterior probability under the two-criterion method of Chen et al. [6], as shown in the following proposition.

Proposition 1.

Assume that the difference in the mean intensities, μ g2−μ g1, follows a normal distribution. The proposed confident difference criterion method ensures that if γ g ≥γ 0, then the maximum value of the posterior probabilities for the difference μ g2−μ g1 being larger or smaller than zero, i.e., max{P r(μ g2−μ g1>0 |D obs ), P r(μ g2−μ g1<0 |D obs )}, is at least Φ(2−Φ −1(1+Φ(−2)−γ 0)) for γ 0>Φ(−2), where Φ and Φ −1 denote the cumulative distribution function (cdf) and the inverse cdf of the standard N(0,1) distribution, respectively. The detailed proof is presented in Additional file 1.

We note that the maximum value of the posterior probabilities for the difference μ g2−μ g1 being larger or smaller than zero measures a Bayesian p-value. Figure 1 shows a graphical presentation of the Proposition 1 with γ 0 chosen to be 0.5 and 0.7, respectively. For example, we use \(\xi _{\mu _{g2}-\mu _{g1}}\) to denote the posterior mean value of the difference μ g2−μ g1. When γ 0=0.5 and assuming that the posterior mean value \(\xi _{\mu _{g2}-\mu _{g1}}>0\), \(\xi _{\mu _{g2}-\mu _{g1}}\) is at least \(1.94 \sigma _{\mu _{g2}-\mu _{g1}}\) away from zero. The maximum value of the posterior probabilities for the difference μ g2−μ g1 being larger or smaller than zero, max{P r(μ g2−μ g1>0 |D obs ), P r(μ g2−μ g1<0 |D obs )}, will be at least Φ(2−Φ −1(1+Φ(−2)−0.5))=97.4 %. Therefore, we recommend to use a smaller cutoff value than the previous two-criterion method [6] when using (1) for identifying DE genes. Possible choices of the cutoff value γ 0 may range from 0.4 to 0.7.

Fig. 1
figure 1

Graphical illustration of the confident difference criterion method. The figure on the left panel and the right panel uses γ 0=0.5 and γ 0=0.7 separately. The μ g2−μ g1 measures the difference in the mean intensities of gene g between the two conditions. Both figures are drawn based on an assumption that the posterior mean of μ g2−μ g1 are positive. The shaded area in both figures measures the posterior probability for having a positive difference μ g2−μ g1

Connection with the CBF method for microarry study with two conditions

For a microarray study with two biological conditions, we assume that the preprocessed expression intensity from each biological condition follows a normal distribution with \(x_{\textit {gtk}} \sim N\left (\mu _{\textit {gt}},\sigma _{\textit {gt}}^{2}\right)\), and the parameters follow the prior distribution specified in the aforementioned Model for microarray data subsection. For simplicity, we assume that the equal number of intensities are observed from the same gene under different conditions, and they share the same known variance, i.e., n g1=n g2=n g and \(\sigma ^{2}_{g1}=\sigma ^{2}_{g2}={\sigma ^{2}_{g}}\). The proposed confident difference criterion method is used to detect differentially expressed genes. Alternatively, we can also apply the CBF method for the data analysis. To detect differentially expressed genes, we test on the null hypothesis that the mean intensities are equal (μ g1=μ g2) against the alternative hypothesis that the mean intensities are unequal (μ g1≠μ g2) between the two biological conditions. We use the same prior distributions as that in the confident difference criterion method under the alternative hypothesis, and similar prior distributions for the parameters under the null hypothesis. With simple algebra, we can show that the proposed confident difference criterion method for comparing the mean intensities between two biological conditions agrees with the CBF method under the condition stated in the following Proposition.

Proposition 2.

The confident difference criterion method comparing the mean intensities between the two biological conditions with a cut-off value of γ 0 agrees with the CBF method for the hypotheses testing on whether the mean intensities are equal between the two biological conditions with a cut-off value of C 0 if \(\Phi \left (2+E_{g}^{\ast }\right)-\Phi \left (-2+E_{g}^{\ast }\right)=1-\gamma _{0}\), where \(|E_{g}^{\ast }|= \left [ \log \left (\frac {n_{g}\omega ^{2}}{2{\sigma ^{2}_{g}}}+1\right) -2\log (C_{0})\right ]^{\frac {1}{2}}\), provided that the cutoff value C 0 is chosen so that the argument in the square root expression is non-negative. The detailed proof is presented in Additional file 1.

Microarray study with multiple conditions

The confident difference criterion method can be extended to microarray studies with multiple biological conditions. Our primary interest of the study is to identify genes that have differential expressions at least between two biological conditions. Therefore, we define a quadratic form to quantify the differences in the gene expression intensities across different biological conditions, and conduct an overall test to determine whether the mean intensities are different at least under two biological conditions on each gene.

Considering the first biological condition as a reference condition, we define a column vector Δ μ,g =(μ g2−μ g1,⋯,μ gT −μ g1)′ to measure the difference in the mean intensities between each non-reference biological condition and the reference condition. Let \(\phantom {\dot {i}\!}\Sigma _{\boldsymbol {\Delta }_{\mu, g}}\) be the posterior covariance matrix of Δ μ,g . We then propose the quadratic form, \(\boldsymbol {\Delta }_{\mu, g}'\Sigma ^{-1}_{\boldsymbol {\Delta }_{\mu, g}}\boldsymbol {\Delta }_{\mu, g}\), to quantify the differential gene expressions for all non-reference biological conditions compared to the reference condition. Under the null hypothesis that gene g is not DE, the quadratic form follows a chi-square distribution with d f=T−1 when Δ μ,g is assumed to follow a multivariate normal distribution. We note that the multivariate normality holds asymptotically when the sample size is large. We choose an integer, denoted as C, which is closest to the 95 th percentile of the chi-square distribution. For example, for a microarray study with three biological conditions (i.e., T=3), the corresponding C value equals 6. Similar to (1), we compute the posterior probability

$$ \gamma_{g}=Pr\left(\boldsymbol{\Delta}_{\mu, g}'\Sigma^{-1}_{\boldsymbol{\Delta}_{\mu, g}} \boldsymbol{\Delta}_{\mu, g}>C|D_{obs}\right), $$
((2))

and declare gene g to be DE if γ g ≥γ 0.

Confident difference criterion for comparison of both means and variances of expression for microarray data

We note that the confident difference criterion method proposed so far only evaluates the differences in mean intensities. Recall that the Bayes factor approach in Yu et al. [36] is more desirable since it compares both means and variances of the intensities for each gene. Assume that the means and variances are equally important. An appropriate quadratic form can be constructed to quantify the overall difference between both the means and the variances under different conditions on each gene. Since the posterior distribution of \(\sigma _{\textit {gt}}^{2}\)’s is typically skewed, a stabilization transformation of the variance \(\sigma _{\textit {gt}}^{2}\) is required. Let q(.) denote a one-to-one transformation function. The differences in both means and transformed variances of the intensities across different conditions can be summarized in a quadratic form given by

$$ Q_{g}=\boldsymbol{\Delta}_{\boldsymbol{\mu},\boldsymbol{\sigma},g}' \Sigma^{-1}_{\boldsymbol{\Delta}_{\boldsymbol{\mu},\boldsymbol{\sigma}, g}} \boldsymbol{\Delta}_{\boldsymbol{\mu},\boldsymbol{\sigma}, g}, $$
((3))

where \(\boldsymbol {\Delta }_{\boldsymbol {\mu }, \boldsymbol {\sigma },g}=\left (\mu _{g2}-\mu _{g1},\cdots, \mu _{\textit {gT}}-\mu _{g1}, q\left (\sigma ^{2}_{g2}\right)-q \left (\sigma ^{2}_{g1}\right),\cdots, q\left (\sigma ^{2}_{\textit {gT}}\right)-q\left (\sigma ^{2}_{g1}\right)\right)'\) is a column vector of length 2T−2 containing the differences in both means and transformed variances of the intensities. The covariance matrix \(\Sigma _{\boldsymbol {\Delta }_{\boldsymbol {\mu },\boldsymbol {\sigma },g}}\phantom {\dot {i}\!}\) is the posterior covariance matrix of Δ μ,σ,g . Since q(·) is a one-to-one transformation function, then we have \( \sigma ^{2}_{\textit {gt}}=\sigma ^{2}_{gt'}\) if and only if \( q\left (\sigma ^{2}_{\textit {gt}}\right)=q\left (\sigma ^{2}_{gt'}\right)\) for t≠t ′. Thus, the same q(·) function has to be used across all the T treatment groups. The primary reason for introducing the transformation function q(·) is to make the distribution of \(q\left (\sigma ^{2}_{\textit {gt}}\right)\) more normal.

Similar to (2), we compute the posterior probability γ g =P r(Q g >C|D obs ), where C is chosen to be an integer, which is closest to the 95 th percentile of the chi-square distribution with d f=2T−2. For example, C will be chosen to be 9 when T=2, and 13 when T=3. In this paper, we consider the negative cube root transformation on the variance parameters \(\sigma ^{2}_{\textit {gt}}\)’s. The cube root transformation, also known as Wilson-Hilferty transformation, was derived by Wilson and Hilferty [35] to transform a chi-square variate to be approximately normally distributed. In the proposed gene selection algorithm below, the cutoff value γ 0 will be automatically determined to control the false discovery rate to be less than a targeted level.

Confident difference criterion for sequence-based data

As discussed in the Model for sequence-based data subsection, the parameter λ gt quantifies the expression level of gene g under condition t. The differences in λ gt ’s measure the relative differential expressions of gene g between the conditions. Note that the λ gt ’s likely have small values and their posterior distributions may be skewed. Therefore, we apply a log transformation on λ gt ’s and use the differences in logλ gt ’s to quantify the differential gene expressions among different biological conditions. Similar to the confident difference criterion method for microarray data, we propose the confident difference criterion method for the sequence-based data with two biological conditions as follows. We first compute

$$ \gamma_{g}=Pr\left(\frac{|\log(\lambda_{g2})-\log(\lambda_{g1})|} {\sigma_{\log(\lambda_{g2})-\log(\lambda_{g1})}}>2|D_{obs}\right), $$
((4))

where \(\sigma _{\log (\lambda _{g2})-\log (\lambda _{g1})}\) is the posterior standard deviation for the difference log(λ g2)− log(λ g1). We then declare gene g to be DE if γ g ≥γ 0, where 0<γ 0<1 is a predetermined credible level.

When the sequence-based data are collected from multiple conditions, the first biological condition will be considered as the reference condition and a column vector Δ λ,g =(log(λ g2)− log(λ g1), log(λ g3)− log(λ g1),⋯, log(λ gT )− log(λ g1))′ contains the differences in the log scaled expression values between the non-reference conditions and the reference condition. Let \({\Sigma }_{\boldsymbol {\Delta }_{\lambda,g}}\phantom {\dot {i}\!}\) denote the posterior covariance matrix of Δ λ,g . Accordingly, the confident difference criterion method is defined as \(\gamma _{g}=Pr\left ({\boldsymbol {\Delta }}_{\lambda,g}'\mathbf {\Sigma }^{-1}_{{\boldsymbol {\Delta }}_{\lambda,g}} {\boldsymbol {\Delta }}_{\lambda,g}>C_{\lambda }|D_{\textit {obs}}\right)\), where C λ is an integer, which is closest to the 95 th percentile of the chi-square distribution with d f=T−1. Again, we declare gene g to be DE if γ g ≥γ 0, where 0<γ 0<1.

From the Model for sequence-based data subsection, we see that the variance of the observed count y gtk is \(m_{\textit {tk}}\lambda _{\textit {gt}} \left (1+ m_{\textit {tk}} \lambda _{\textit {gt}} \phi ^{-1}_{t}\right)\), which is a function of λ gt and ϕ t , for the sequence data. Since ϕ t does not depend on g, it is sufficient to compare the mean expressions under different conditions for determining DE genes for the sequence data.

False discovery rate and gene selection algorithm

The proposed confident difference criterion methods calculate the value of γ g on each gene, whose magnitude reflects the evidence of differential expression. When γ g is large enough, the gene will be declared to be DE. It is of great importance to determine how to choose the cutoff value γ 0.

We adopt the approach proposed by Tadesse et al. [31] to select the cutoff value γ 0 for controlling the Bayesian FDR. Let V denote the number of incorrect decisions by identifying EE genes as DE genes and let R be the number of identified DE genes. Then the positive false discovery rate defined by Storey [30] is given by \(pFDR=E\left (\frac {V}{R}|R>0\right)\).

We need to test the hypotheses of H 0g : gene g is EE versus H 1g : gene g is DE on each gene. We assume that all genes have the same probability of being EE, and DE, respectively, i.e., P r(H 0g )’s are equal for all genes, and P r(H 1g )’s are equal for all genes. Therefore, the γ g ’s are independently and identically distributed. Following Tadesse et al. [31], the Bayesian FDR b F D R(γ 0) when using a cutoff value of γ 0 for the confident difference criterion method is defined as

$$ bFDR(\gamma_{0})=\frac{1}{Pr(R>0)}\times \frac{Pr\left(\gamma_{g} \ge \gamma_{0}|H_{0g}\right)Pr(H_{0g})}{Pr(\gamma_{g} \ge \gamma_{0})}, $$
((5))

and P r(γ g ≥γ 0)=P r(γ g ≥γ 0|H 0g )P r(H 0g )+P r(γ g ≥γ 0|H 1g )P r(H 1g ). To estimate the FDR, we need to compute P r(γ g ≥γ 0|H 0g ), P r(γ g ≥γ 0|H 1g ) and P r(H 1g ). Note that gene g can be classified into DE or EE depending on whether γ g ≥γ 0. We reuse the data information and specify the prior probability P r(H 1g ) as the proportion of genes classified as DE. Denote the total number of identified DE genes as n D . Then the probability of a gene being DE will be P r(H 1g )=n D /G. Additionally, we estimate the true parameters in the gene expression data distributions from DE or EE genes as the posterior means of the corresponding parameters from the identified DE and EE genes, respectively. An algorithm using the posterior samples from DE or EE genes to estimate the aforementioned probabilities P r(γ g ≥γ 0|H 0g ), P r(γ g ≥γ 0|H 1g ) and b F D R(γ 0) is given as follows.

  • Split the genes into two subsets containing n E EE genes (EEGENE) with calculated γ g <γ 0 and n D DE genes (DEGENE) with γ g ≥γ 0.

  • Note that a DE gene can be either up or down regulated under some condition compared to the reference condition in terms of means or variances of the expression values for microarray experiment or in terms of mean gene expressions for sequence-based experiment. Accordingly, the DE genes will be further split into a series of gene subsets based on the pattern of parameters in comparisons under different biological conditions. For example, in a microarray study with three biological conditions and the mean gene expressions are in comparison. Consider the first condition as the reference condition. The DE genes can be classified into four subsets: (i) genes with lower mean gene expressions under both conditions 2 and 3; (ii) genes with lower mean gene expressions under condition 2 but higher mean gene expressions under condition 3; (iii) genes with higher mean gene expressions under condition 2 but lower mean gene expressions under condition 3; and (iv) genes with higher mean gene expressions under both conditions 2 and 3. We denote these subsets of DE genes as D â„“ ,â„“=1,⋯,L, where the number of subsets L=2T−1 for the microarray study when only mean parameters are in comparison or the sequenced-based study; and L=4T−1 for the microarray study when both mean and variance parameters are in comparison. We also denote the number of genes in the D â„“ DE gene subsets as n D â„“ .

  • For EE genes identified in previous steps, the same data distributions will be considered for the gene expression data from different biological conditions. Hierarchical priors similar to those proposed previously in the Model for microarray data section and the Model for sequence-based data subsection will be augmented separately for microarry data or sequence-based data. The posterior mean of each parameter defined in the distribution of the gene expression data will be calculated. The true parameters in the gene expression data distribution will be estimated using the average value of posterior mean of corresponding parameters from all EE genes. For all identified DE genes, the Markov chain Monte Carlo (MCMC) sampling values from previous steps when implementing the proposed confident difference criterion method will be used for calculating the posterior means of the parameters defined in the gene expression data distribution. For each differential gene expression pattern â„“, the actual value of each parameter in the gene expression data distribution will be estimated using the average value of its posterior means across all genes in the subset D â„“ with this DE pattern.

  • Using the estimated values for the parameters in the gene expression data, the data will be simulated for κ×G genes (say κ=0.1), among which κ n E EE genes and \(\kappa n_{D_{\ell }}, \ell =1,\dots,L\) DE genes with a pattern of differential gene expression observed in the DE gene subset D â„“ , respectively.

  • The posterior probability γ g will be calculated for each gene based on the simulated data. Depending on whether γ g ≥γ 0, the gene in the simulated data will also be claimed to be DE or EE.

  • Denote the total number of identified DE and EE genes from the simulated data as m D and m E . Then the probability for a EE genes claimed to be DE, P r(γ g ≥γ 0|H 0g ), will be estimated as \(Pr(\gamma _{g} \ge \gamma _{0}|H_{0g})=\frac {m_{E}}{\kappa n_{E}}\); and the probability for a DE genes claimed to be DE, P r(γ g ≥γ 0|H 1g ), will be estimated as \(Pr(\gamma _{g} \ge \gamma _{0}|H_{1g})=\frac {m_{D}}{\kappa n_{D}}\). Using (5), the estimated Bayesian FDR equals \(\widehat {bFDR}=\frac {m_{E}}{m_{D}+m_{E}}\).

Note that steps (4) to (6) provide a predictive approach to estimate bFDR when a certain value γ 0 was used for identifying DE genes. Therefore, we can control the FDR at some pre-specified value (i.e. 0.05) by choosing the corresponding cutoff value \(\hat {\gamma _{0}}\) as the minimum value of all cutoff values with an associated FDR no more than 0.05, or \(\hat {\gamma }_{0}=\min \{\gamma _{0}: (\widehat {bFDR}(\gamma) \le 0.05)\}\).

Results and discussion

In this section, two different simulation studies were conducted to investigate the performance of the proposed confident difference criterion methods on identifying DE genes for microarray or sequence-based studies, respectively. In addition, a real affymetrix dataset is used to further demonstrate the proposed methodology.

Simulation study I: Microarray data

Two settings were considered. In the first setting, the intensity values having different means and variances between two biological conditions on each DE gene are simulated. In the second setting, the data were simulated from three biological conditions, with DE genes having different mean and variance values between at least two biological conditions.

Setting 1 (Two conditions)

Fifty simulations were used in this study to investigate the performance of different versions of the confident difference criterion methods described in the Confident difference criterion section. In each simulation, there were 5000 genes in total and 500 DE genes with 10 replications under each of the two biological groups. The log-scaled data were generated via \(x_{g1k} \overset {\text {iid}}{\sim } \mathcal {N}(\mu _{g}-0.5 \delta _{g}, 0.2^{2}), x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g}+0.5\delta _{g}, 0.9^{2}\right)\) with δ g =1,∀g=1,…,250 and δ g =−1,∀g=251,…,500 for the DE genes, and \(x_{g1k},x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}(\mu _{g}, 0.7^{2})\) for the remaining EE genes. The average intensities μ g were generated from an uniform distribution, where \(\mu _{g} \overset {\text {iid}}{\sim } \mathcal {U} (5,11)\) for all genes. Conditionally conjugate priors described in the Model for microarray data subsection were used for all parameters μ g , δ g , \(\sigma _{g1}^{2}\), \(\sigma _{g2}^{2}\) and \({\sigma _{g}^{2}}\).

The simulated data were analyzed using both Methods I and II of the confident difference criteroin methods. For each version, the cutoff value γ 0 were set to be 0.4, 0.6, or the cutoff value controlling the FDR to be no more than 0.05, separately. The genes with the calculated posterior distribution values γ g via Equation (1) or Equation (3) less than the chosen γ 0 were identified to be DE. To evaluate the performance of the confident difference criterion methods, the simulated datasets were also analyzed by four existing methods: Significant Analysis of Microarrays (SAM) [33], Linear Models for Microarray Data (LIMMA) [28], Semiparametric Hierarchical Method (SPH) [23], Empirical Bayesian Analysis of Gene Expression Data (EBarrays or EBA) [14]. All these existing methods allowed a control of the FDR for multiple comparisons. The genes were declared to be DE with FDR controlled at 0.05 for all these four methods.

Based on the identified gene list by each method, we calculated the number of claimed DE genes (CDE), the number of correctly claimed DE genes (CCDE), the number of correctly claimed EE genes (CCEE), the false positive rate (FPR), false negative rate (FNR), false discovery rate (FDR) and false non-discovery rate (FNDR) for all considered methods. These results and their standard deviations reported in parentheses were summarized in Table 1. Note, for Methods I and II, the choice of γ 0=0.4 identified the a larger number of DE genes when compared to the choice of γ 0=0.6. While for the case with FDR control of 0.05, the Method II identified the largest number of DE genes among all six methods. We also compared the results of the confident difference criterion method with a control of FDR against all four existing methods. We expected a method with good performance will provide a good control of FDR and provide smaller error rates in terms of FPR, FNR and FNDR. Under both versions of the confident difference criterion method, the achieved FDR is close to but less than 0.05, implying that the proposed confident difference criterion methods provided a good control of the FDR. All four existing methods also obtained a control of FDR at 0.05 successfully, although the SPH method provides a conservative control of FDR with the empirical FDR equal to 0.02. Since all methods provided small error rates of FPR and FNDR, we put more weight to the comparison of the empirical FNRs among all applied methods. The results in Table 1 showed that Method II provided the smallest empirical FNR out of all methods by successfully identifying almost all truly DE genes; and Method I had comparable empirical FNR as the SAM and the LIMMA methods, and much smaller empirical FNR when compared to the SPH and the EBA methods.

Table 1 Performance evaluation under Study I (Setting 1), (G=5000, 500 DE gene) #

Setting 2 (Three conditions)

The data were simulated from three biological conditions, and the first biological condition was considered as the reference group. A gene was set to be DE so that at least one group would be either up or down regulated from the reference group. Specifically, 500 DE genes out of 5000 genes were set in the data, and the log intensities of the DE genes were generated via \(x_{g1k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}, 0.2^{2}\right), x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}+0.5\nu _{g1}, 0.5^{2}\right), x_{g3k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}+0.5 \nu _{g2}, 0.8^{2}\right)\). Depending on whether the gene was set to be DE in one or both conditions from reference group, the parameters ν g1 and ν g2 were set to have ν g1=ν g2=1.5 for g=1,⋯,62 (up-regulated in both conditions); ν g1=1.5, ν g2=0 for g=63,⋯,125 (only up-regulated in condition 2); ν g1=1.5, ν g2=−1.5 for g=126,⋯,187 (up-regulated in condition 2, down-regulated in condition 3); ν g1=0 and ν g2=1.5 for g=188,⋯,250 (only up-regulated in condition 3); ν g1=0, ν g2=−1.5 for g=251,⋯,312 (only down-regulated in condition 3); ν g1=−1.5, ν g2=1.5 for g=313,⋯, 375 (down-regulated in condition 2, up-regulated in condition 3); ν g1=−1.5, ν g2=0 for g=376, ⋯,437 (only down-regulated in condition 2); ν g1=ν g2=−1.5, for g=438,⋯,500 (down-regulated in both conditions). The remaining genes were EE and their log intensities were generated via \(x_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g}, 0.6^{2}\right)\) for t=1,2,3 and g=501,⋯,5000. On all genes, the parameter μ g1 were generated from an uniform distribution, i.e., \(\mu _{g1} \overset {\text {iid}}{\sim } \mathcal {U}(5,11)\). Each condition contained 10 replicates on each gene and 50 simulations were conducted.

The model similar to those described in Model for microarray data subsection and the proposed confident difference criterion methods including the Method I for comparing mean expressions and the Method II comparing both mean and variance expressions were applied to the simulated data. We considered three choices for the cutoff value γ 0, including prespecified values 0.4, 0.6, or a value with FDR controlled at 0.05, separately. The data were also analyzed by the SAM [33], LIMMA [28], and EBArrays [14] with the FDR controlled at 0.05. The SPH [23] were not used in the study as they were proposed for studies with two biological conditions only. The analytical results from all methods were compared based on four error rates including FPR, FNR, FDR and FNDR from each considered method (Table 2). The confident difference criterion methods including Method I and Method II as well as the existing methods except LIMMA all provided an empirical FDR no more than 0.05 successfully. Comparing to the existing methods, the proposed confident difference criterion methods provided comparable FPR and smaller FNR and FNDR. Method II of the confident difference criterion method compares both mean and variance values of the gene expression intensities across different biological conditions. This is a potential reason for the proposed method providing smaller FNR for microarray data analysis. The confident difference criterion method is particularly effective when both mean and variance of the expression intensities differ across biological conditions on the DE genes.

Table 2 Performance evaluation under Study I (Setting 2), (G=5000, 500 DE gene) #

Simulation Study II: sequence-based data

The focus of this study is to investigate the performance of the proposed confident difference criterion method for identifying DE genes from sequence-based high-throughput experiments including SAGE and RNA-Seq studies.

Setting 1 (SAGE experiment)

The simulation proposed by Lu et al. [20] was used to conduct the simulation study. Specifically, 5000 genes were sampled under five libraries from each of the two conditions with fixed library sizes sampled uniformly between 30000 and 90000. A total of 500 genes were set to be DE genes. The data were generated from a negative binomial distribution, \(y_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {NB}(\phi _{t}, \frac {m_{\textit {tk}}\lambda _{\textit {gt}}}{\phi _{t}+m_{\textit {tk}}\lambda _{\textit {gt}}})\) for gene g, for a fixed library k of condition t, where m tk was the library size for library k under condition t; ϕ 1 and ϕ 2 denoted the dispersion parameters for data from the two conditions separately, and both set to be 0.4; λ gt measured the expression level of gene g under condition t and were set with different values when gene g is DE and the same value when gene g is EE. For g=1,⋯,250, we set λ g1=8E−4 and λ g2=2E−4 to include down-regulated genes in condition 2. For g=251,⋯,500, we set λ g1=2E−4 and λ g2=8E−4 to include up-regulated genes in condition 2. For other genes with g=501,⋯,5000, we set λ g1=λ g2=2E−4 to include EE genes. Fifty simulations were used in this study.

The proposed confident difference criterion method for RNA-Seq data was used to analyze the simulated data. The posterior probability γ g measuring the evidence of differential gene expression were estimated using average value of its posterior sampled values. The cutoff value γ 0 for γ g were set to be 0.4, 0.6 or a value to control the FDR to be 0.05, separately. The genes with estimated γ g less than the chosen γ 0 value were claimed to be DE. We also fit several other existing methods including edgeR [26], DESeq [2], BaySeq [10], NBPSeq [8], EBSeq [17], NOISeq [32], SAMSeq [19], and TSPM [3]. When the edgeR method was applied, we chose both options to estimate the common dispersion parameter for all tags and the tag-wise dispersion parameters respectively. For the NOISeq method, we estimated and controlled the FDR using the method proposed by Newton et al. [23] for identifying DE genes.

The results using the proposed confident difference criterion methods and all fitted existing methods for RNA-Seq data were summarized in Table 3. Similar to the simulation study I for microarray data, Table 3 showed that the higher the cutoff value γ 0, the less number of genes were identified to be DE. The confident difference criterion method with a control of FDR at 0.05 achieved an empirical FDR of 0.044, and successfully identified 328.8 genes (65.8 %) on average out of 500 truly DE genes. Compared to other considered methods, the confident difference criterion method performed the best by providing the smallest FNR and FNDR while maintaining comparable FPR and a well controlled FDR. Out of the applied existing methods, the NOISeq method and edgeR method achieved the lowest FNR, and a FDR of no more than 0.05. The BaySeq method provided a conservative control of FDR, and achieved an empirical FDR of lower than 0.001 when controlling the FDR at 0.05. The DESeq, EBSeq and TSPM methods failed to control the FDR at 0.05. The SAMSeq method and TSPM method failed to identify most of the truly DE genes as DE genes, which was not surprising as the performance of both the SAMSeq and TSPM methods is highly sample size dependent as pointed out by Soneson and Delorenzi (2013) [29].

Table 3 Performance evaluation under Study II (Setting 1), (G=5000, 500 DE gene) #

Setting 2 (RNA-Seq experiment)

We used a similar simulation setting proposed by Kvam et al. [16] for illustrating the application of the proposed confident difference criterion method for RNA-Seq experiment. We still simulated 50 dataset, each dataset contained six libraries with three libraries from each of the two conditions on 5000 genes, among which 250 genes were set to be up-regulated genes and another 250 genes were set to be down-regulated genes in condition 2 versus condition 1. The overall mean expression levels across both conditions were generated from a gamma distribution with \(\lambda _{g} \sim \mathcal {G}(0.25, 600)\). To avoid including genes with low expression levels from both conditions as DE genes, we set the difference in the gene expression levels between conditions in two ways depending on whether the value of λ g is larger than one. Specifically, we generated ξ g from uniform distribution \(\mathcal {U}(3,20)\) for each gene. If the value of λ g >1, we let the fold change between the gene expression values of DE genes to be ξ g , or \(\lambda _{g1}=\lambda _{g}/\sqrt {\xi _{g}}\) and \(\lambda _{g2}=\lambda _{g}*\sqrt {\xi _{g}}\) for up-regulated genes, and \(\lambda _{g1}=\lambda _{g}*\sqrt {\xi _{g}}\) and \(\lambda _{g2}=\lambda _{g}/\sqrt {\xi _{g}}\) for down-regulated genes. If the value of λ g ≤1, we let the absolute difference in the gene expression values to be ξ g , or we let λ g1=λ g +ξ g and λ g2=λ g for down-regulated genes, and λ g1=λ g and λ g2=λ g +ξ g for up-regulated genes in condition 2. For an EE gene, we had λ g1=λ g2=λ g .

Then we generated the data using negative binomial distribution of \(y_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {NB}(\phi _{t}, \frac {\lambda _{\textit {gt}}}{\phi _{t}+\lambda _{\textit {gt}}})\) for gene g, and the overdispersion parameters Ï• 1 and Ï• 2 were set to have Ï• 1 = 1 and Ï• 2 = 8 respectively for DE genes; and Ï• 1=Ï• 2=4 for EE genes.

All methods applied in setting I of simulation study II were also used for data analysis in this simulation study. The results in Table 4 displayed that the confident difference criterion method with a control of FDR at 0.05, the edgeR method with common dispersion parameter over genes, the edgeR with gene-wise dispersion parameter, the BaySeq, the NBPSeq, the NOISeq methods successfully controlled the FDR at 0.05. Additionally the confident difference criterion method, the NBPSeq method, the edgeR method with a common dispersion parameter over genes also provided a good and comparable control of FNR of less than 0.2, while maintaining low levels of FPR and FNDR.

Table 4 Performance evaluation under Study II (Setting II), (G=5000, 500 DE gene) #

Real data analysis

We used a real data set obtained using customized Bovine Affymetrix arrays (Davis, Talbott, Yu, and Cupp, unpublished results) to illustrate the proposed method. Fifteen arrays composed of three replicate arrays under three biological conditions were produced to screen for DE genes associated with prostaglandin F2α(PGF) treatment after 30 min, 1 h, 2 h, and 4 h compared to the control treatment (saline). For simplicity, we focused on detecting genes using the confident difference criterion methods (Method I and Method II) that were regulated 1 h or 2 h after PGF treatment. The data were extracted, normalized and summarized using the Robust Multi-array Average (RMA) [12] method at the exon level via the Affymetrix expression console. The data set contains 21724 genes. Note that some genes may have multiple probe replicates ranging from one replicate to 266 replicates, and the data from different probes of the same gene may have large variation even after RMA normalization. We centered the data from each probe of the same gene to the mean log intensities of that gene, and excluded 3116 genes with only a single probe replicate from the analysis to make sure that the parameters were estimable. Additionally, we excluded 2137 low expression genes if two-thirds or more (six out of nine) samples on this gene had gene expression values measured by the geometric mean expression values across different probes less than 10. Of the remaining 16471 genes with replicate probes, we used z gjtk to denote the k th biological replicate sample of the log2 scale gene expression intensity for probe j of gene g under condition t. Note that the index j was added to the previous notations for the log intensity values as data are available for multiple probes on the same gene. We assumed normal distribution for the log2 intensities with \(z_{\textit {gjtk}} \sim \mathcal {N}(\mu _{\textit {gtk}}, \sigma ^{2\ast }_{\textit {gt}})\), and the same prior for μ gtk as what we set for X gt in the Model for microarray data subsection. The variance parameters are assumed to follow inverse gamma distribution with \(\sigma ^{2\ast }_{\textit {gt}} \sim \mathcal {IG}(\alpha ^{\ast }_{t},\beta ^{\ast }_{t})\) with \(\alpha ^{\ast }_{t}=2\) and \(\beta ^{\ast }_{t} \sim \mathcal {G}(\alpha ^{\ast }_{0},\beta ^{\ast }_{0})\). We set \(\alpha ^{\ast }_{0}=1\) and \(\beta ^{\ast }_{0}\sim IG(\alpha ^{\ast },\beta ^{\ast })\) where both α ∗=β ∗=1. During computation for controlling the FDR, we reuse these settings of the prior distributions on the parameters μ gtk and \(\sigma ^{2\ast }_{\textit {gt}}\) for DE genes. For EE gens, we assume that \(z_{\textit {gjtk}} \sim \mathcal {N}(\mu _{\textit {gtk}},\sigma ^{2\ast }_{g})\), and make similar augment for the prior distributions of their parameters μ gtk and \(\sigma ^{2\ast }_{g}\) as the DE genes. The proposed confident difference criterion methods were applied to assess the evidence of differential expression on each gene and identify DE genes with the cutoff value equal to be 0.4, 0.6 or a value that controls the FDR at 0.05.

In addition, we analyzed the real data using the existing methods including SAM, LIMMA, and EBarrays as described in the Simulation Studies section for identification of DE genes. Since the existing methods were developed for data with single probe replicate on each gene, we calculated the mean log intensities over all probes for each biological sample on each gene to quantify the corresponding gene expression. The genes were declared to be DE if the false discovery rate was no more than 0.05. We used Venn diagrams to demonstrate the overlap of DE genes identified by Method I (Fig. 2, Left Panel) or Method II (Fig. 2, Right Panel), to the DE genes identified by SAM and EBarrays (Fig. 2). The results showed that more genes were identified to be DE by the proposed Method I and Method II than the existing methods. Specifically, 1050 DE genes were identified by Method II, while 896 genes were identified to be DE by either SAM or EBarrays. Of note 340 out of 353 DE genes identified by LIMMA were also identified by SAM (data not shown), and 951 of 991 DE genes identified by Method I were also identified as DE by Method II. We found that SAM identified 375 DE genes, all of which were also identified by other methods. For example, 358 (95.5 %) genes identified by SAM were also identified by Method I or II; and 342 (91.2 %) genes identified by SAM were also identified by EBarrays method. The EBarrays method identified 863 DE genes, out of which, 643 (74.5 %) genes were also identified by Method I or II. Method I identified 116 of the 324 genes identified by LIMMA when comparing all four time points versus control in the same dataset, while Method II called 105 out of 387 genes DE that were also called DE by LIMMA within the whole dataset. In addition, many genes identified to be DE only by Method II not by Method I show a linear trend among the average gene expression across conditions observed from samples collected with longer time after treatment, and larger data variations under the control condition than those observed at other time points after treatment. For example, the average log2 gene expression of THBS1 increased from 9.22 under control condition to 10.35 at 2h after treatment, and the standard deviation equaled 0.88 under the control condition, and 0.37 at 2 h after treatment. This gene was only detected to be DE by Method II and was shown to play roles in angiogenesis [37].

Fig. 2
figure 2

Number of identified DE genes out of 16471 genes from real data analysis. Two venndiagrams present the overlapping among the DE genes identified separately by Method I/II, SAM, and EBarrays with the false discovery rate controlled at 0.05 from the real data

The genes identified solely by Method I or Method II were analyzed by Ingenuity Pathway Analysis (IPA, Build version: 313398M, Content version: 18841524 (Release Date: 2014-06-24) to determine biological functions and pathways associated with the newly identified genes. Genes identified solely by Method I and not by SAM or EBarrays were analyzed by IPA which identified several major canonical pathways such as hepatic fibrosis / hepatic stellate cell activation, glucocortiocoid receptor signaling, agranulocyte adhesion and diapedesis, and role of IL-17A in arthritis (Additional file 2: Table S1). Many of the canonical pathways identified have either established or potential roles in corpus luteum function indicating that Method I identified DE genes that are biologically relevant within the model. Method I also identified IL1B (P=2.12E−08) and TNF (P=3.03E−08) as upstream regulators of the genes found exclusively by Method I, which also fits with known and suspected mechanisms of PGF action within the corpus luteum [1, 24].

Genes identified solely by Method II were also analyzed by IPA which identified canonical pathways such as hepatic fibrosis/hepatic stellate cell activation [21], glucocorticoid receptor signaling, IL-8 signaling, and granulocyte adhesion and diapedesis. Upstream regulators of gene found solely by Method II included: IL1B (P=4.56E−13), TGFB1 (P=1.19E−11), and IFNG (P=1.82E−11). The IPA results both concur with current literature and offer new insights into the possible mechanism(s) of action of PGF in the corpus luteum [1, 9, 11, 21]. These and similar canonical and regulatory functions were also identified when the complete dataset (30 min, 1 h, 2 h, and 4 h) was analyzed by IPA. These network functions are in agreement with the known or suspected changes in biological function in the corpus luteum following PGF treatment in several species [1, 5, 22, 27]. Several of the genes identified by Methods I and II are known to be involved in regulation of the fate of the corpus luteum after PGF treatment, and were also identified as DE genes in our larger data set and a similar microarray dataset examining the effects of PGF in the cow [22]. For example, genes that code for chemokines (e.g., CCL3 and CCL8) and prostaglandin synthesis (e.g., PTGS2) were found to be significantly up-regulated at 1 and 2 h using Methods I and II which were not identified using LIMMA. However, CCL3, CCL8, and PTGS2 were all identified as significantly up-regulated in later time points using LIMMA, which conservatively identifies DE genes. Therefore, it seems possible that Methods I and II may provide a more sensitive approach to identify the temporal patterns of gene expression.

Conclusion

In this paper, we have proposed a new differentially expressed gene selection algorithm, which controls the FDR based on predictive Bayesian estimates. The simulation studies empirically showed that the proposed confident difference criterion methods outperform the existing methods when comparing gene expressions across different conditions for both microarray studies and sequence-based high-throughput studies. For the analysis of the real data, the method II successfully identified more clinically important DE genes than the other methods. In comparison to Method I, the Method II provides a much better sensitivity rate, but slightly a lower specificity rate based on the simulation studies.

In scenarios where the data are not symmetrically distributed, we need to model the data with other types of distributions (e.g., a gamma distribution). The confident difference criterion method proposed for comparing both means and variances can also be extended to evaluate the differences in multiple parameters defined in the non-normal data distributions. In addition, the Euclidean distances used in the proposed confident difference criterion method may also be extended to other types of distances to measure the difference among the distributions under two or more biological conditions. In the case of two conditions, the entropy-based distance such as the Kullback-Leibler (KL) divergence may be considered. However, the distribution of the entropy-based statistics is quite difficult to characterize and, hence, it is quite challenging to choose the cutoff value for the entropy statistics. Such extensions need to be further investigated. Finally, we note that all models considered in this paper assume that the gene expressions are independent across genes. The proposed confident difference criterion methods do not require the independence assumption. However, the performance of the confident difference criterion methods under the correlated models need to be further examined.

Availability and requirements

All analyses results presented in this paper were obtained using codes developed in FORTRAN with IMSL library. We have also implemented the proposed method in R for windows (32 bits). The R codes can be obtained at the websites: http://www.unmc.edu/publichealth/departments/biostatistics/facultyandstaff/cdc_micro.zipand http://www.unmc.edu/publichealth/departments/biostatistics/facultyandstaff/cdc_RNASeq.zip.

References

  1. Atli MO, Bender RW, Mehta V, Bastos MR, Luo W, Vezina CM, et al. Patterns of gene expression in the bovine corpus luteum following repeated intrauterine infusions of low doses of prostaglandin F2 α. Biol Reprod. 2012; 86(4):130.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Auer PL, Doerge RW. A two-stage poisson model for testing RNA-Seq data. Stat Appl Genet Mol Biol. 2011; 10:1–26.

    Google Scholar 

  4. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008; 456(7218):53–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Bishop CV, Bogan RL, Hennebold JD, Stouffer RL. Analysis of microarray data from the macaque corpus luteum; the search for common themes in primate luteal regression. Mol Hum Reprod. 2011; 17(3):143–51.

    Article  CAS  PubMed  Google Scholar 

  6. Chen M-H, Ibrahim JG, Chi Y-Y. A new class of mixture models for differential gene expression in DNA microarray data. J Stat Plan Inference. 2008; 138:387–404.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Dudroit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 2002; 12:111–39.

    Google Scholar 

  8. Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Stat Appl Genet Mol Biol. 2011; 10(1):1–28.

    Google Scholar 

  9. Galväo AM, Ferreira-Dias G, Skarzynski DJ. Cytokines and angiogenesis in the corpus luteum. Mediators Inflamm. 2013; 2013:420186.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Hardcastle TJ, baySeq KellyKA. Empirical Bayesian analysis of patterns of differential expression in count data. BMC Bioinformatics. 2010; 11:422–35.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Hou X, Arvisais EW, Jiang C, Chen DB, Roy SK, Pate JL, et al. Prostaglandin F2 α stimulates the expression and secretion of transforming growth factor B1 via induction of the early growth response 1 gene (EGR1) in the bovine corpus luteum. Mol Endocrinol. 2008; 22(2):403–414.

    Article  CAS  PubMed  Google Scholar 

  12. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003; 4(2):249–64.

    Article  PubMed  Google Scholar 

  13. Ibrahim JG, Chen M-H, Gray RJ. Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc. 2002; 97:88–99.

    Article  Google Scholar 

  14. Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003; 22:3899–914.

    Article  CAS  PubMed  Google Scholar 

  15. Kuo L, Yu F, Zhao Y. Statistical methods for identifying differentially expressed genes in replicated experiments: A review. In: Biswas A, Data S, Fine J, Segal M, editors. Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics. Hoboken, NJ: Wiley-Interscience: 2008. p. 341–64.

    Google Scholar 

  16. Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot. 2012; 99(2):248–56.

    Article  PubMed  Google Scholar 

  17. Leng N, Dawson JA, Stewart RM, Ruotti V, Rissman A, Smits B, et al. EBseq: An empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013; 29(8):1035–43.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008; 456(7218):66–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Stat Methods Med Res. 2013; 22:519–36.

    Article  PubMed  Google Scholar 

  20. Lu J, Tomfohr JK, Kepler TB. Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. BMC Bioinformatics. 2005; 6:165.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Maroni D, Davis JS. TGFB1 disrupts the angiogenic potential of microvascular endothelial cells of the corpus luteum. J Cell Sci. 2012; 124(14):2501–510.

    Article  CAS  Google Scholar 

  22. Mondal M, Schilling B, Folger J, Steibel JP, Buchnick H, Zalman Y, et al. Deciphering the luteal transcriptome: potential mechanisms mediating stage-specific luteolytic response of the corpus luteum to prostaglandin F 2 α. Physiol Genomics. 2011; 43(8):447–56.

    Article  CAS  PubMed  Google Scholar 

  23. Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004; 5:155–76.

    Article  PubMed  Google Scholar 

  24. Okuda K, Sakumoto R. Multiple roles of TNF super family members in corpus luteum function. Reprod Biol Endocrinol. 2003; 1:95.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002; 18:546–54.

    Article  CAS  PubMed  Google Scholar 

  26. Robinson MD, McCarthy DJ. Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–40.

    Article  CAS  PubMed  Google Scholar 

  27. Romero JJ, Antoniazzi AQ, Smirnova NP, Webb BT, Yu F, Davis JS, et al. Pregnancy-associated genes contribute to antiluteolytic mechanisms in ovine corpus luteum. Physiol Genomics. 2013; 45(22):1095–1108.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3(1):Article 3.

    Google Scholar 

  29. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-Seq data. BMC Bioinformatics. 2013; 14:91.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Storey JD. A direct approach to false discovery rates. J R Stat Soc Ser B. 2002; 64:479–98.

    Article  Google Scholar 

  31. Tadesse MG, Ibrahim JG, Vannucci M, Gentleman R. Wavelet thresholding with Bayesian false discovery rate control. Biometrics. 2005; 61:25–35.

    Article  PubMed  Google Scholar 

  32. Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A. Differential expression in RNA-Seq: a matter of depth. Genome Res. 2011; 21:2213–223.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Tusher VG, Ti bshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2011; 98:5116–121.

    Article  Google Scholar 

  34. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, et al. The diploid genome sequence of an Asian individual. Nature. 2008; 456:60–65.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Wilson EB, Hilferty MM. The distribution of chi-square. Proc Natl Acad Sci U S A. 1931; 17:684–88.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Yu F, Chen M-H, Kuo L. Detecting differentially expressed genes using calibrated Bayes factors. Statistica Sinica. 2008; 18:783–802.

    Google Scholar 

  37. Zalman Y, Klipper E, Farberov S, Mondal M, Wee G, Folger JK. Regulation of Angiogenesis-Related Prostaglandin F2alpha-Induced Genes in the Bovine Corpus Luteum. Biology of Reproduction. 2012; 86(3):92.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We would like to thank the Editor and two referees for their very helpful comments and suggestions, which have led to an improved version of the paper. This work was supported by COPH Dean’s Mentored Research Grant at UNMC (FY), NIH grants #GM 70335 and #P01 CA142538 (MHC), Agriculture and Food Research Initiative (AFRI) Competitive Grant no. 2011-67015-20076 from the USDA National Institute of Food and Agriculture (NIFA) (JSD); the Department of Veterans Affairs (JSD), the Olson Center for Women’s Health (HT and JSD), and a AFRI NIFA Predoctoral Fellowship Award (HT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fang Yu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

FY, MHC and LK developed the method, and carried out the simulation and real data analysis. HT and JD provided the real data and conducted the Ingenuity pathway analysis. All authors contributed to the writing, proof reading and approval of the final manuscript.

Additional files

Additional file 1

Methods. Mathematical Proof for Propositions 1 and 2.

Additional file 2

Real Data Analysis Results. Canonical Pathways identified by Methods I and II.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, F., Chen, MH., Kuo, L. et al. Confident difference criterion: a new Bayesian differentially expressed gene selection algorithm with applications. BMC Bioinformatics 16, 245 (2015). https://doi.org/10.1186/s12859-015-0664-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-015-0664-3

Keywords