Center for Statistical Sciences and Department of Community Health, Box G-121S-7, Brown University, Providence RI 02912, USA

Department of Cell and Molecular Biology The University of Rhode Island, 120 Flagg Road, Kingston, RI 02881, USA

The Graduate School of Oceanography, University of Rhode Island, South Ferry Road, Narragansett, RI 02882, USA

Biology Department, Woods Hole Oceanographic Institution, Woods Hole MA 02543, USA

Marine Chemistry and Geochemistry Department, Woods Hole Oceanographic Institution, 360 Woods Hole Rd, Woods Hole MA 02543, USA

Abstract

Background

Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Advantages of sequencing-based transcriptional profiling over microarrays have been reported, including lower technical variability. However, advances in technology do not remove biological variation between replicates and this variation is often neglected in many analyses.

Results

We propose an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to detect differential expression based on sequencing technology. ASC borrows information across sequences to establish prior distribution of sample variation, so that biological variation can be accounted for even when replicates are not available. Compared to current approaches that simply tests for equality of proportions in two samples, ASC is less biased towards highly expressed sequences and can identify more genes with a greater log fold change at lower overall abundance.

Conclusions

ASC unifies the biological and statistical significance of differential expression by estimating the posterior mean of log fold change and estimating false discovery rates based on the posterior mean. The implementation in R is available at

Background

Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Unlike microarrays, enumeration of transcript abundance with sequencing technology is based on direct counts of transcripts rather than relying on hybridization to probes. This has reduced the noise caused by cross-hybridization and the bias caused by the variation in probe binding efficiency. Sequencing-based transcription profiling does have other challenges. For example, whole transcript analysis produces data with transcript length bias

For illustration, we use data from Illumina Digital Gene Expression (DGE) tag profiling in this paper. However, our statistical methodology, and its implementation in R, are general for all sequencing-based technologies that quantify gene expression as counts instead of continuous measurements such as probe intensity in microarrays. In DGE, the 3' end of transcripts with a poly-A tail are captured by beads coated with oligo dT. Two restriction enzymes, NlaIII and Mmel are used to digest the captured transcripts, generating a 21-base fragment starting at the most 3' NlaIII site. The 21-base fragments are sequenced to quantify the transcriptome. Consider two samples in a comparison and let _{1 }and _{2 }be the counts of a particular sequence tag in the two samples. The most common approach is to consider the counts as a realization of binomial distribution _{i}
_{i}
_{i }
_{1 }= _{2 }can be conducted. The classical Z-test using the Gaussian approximation to the binomial distribution is proposed for the Serial Analysis of Gene Expression (SAGE) data _{1 }and _{2}.

The test for _{0 }: _{1 }= _{2 }can be performed without replicates. However, rejection of the _{0 }hypothesis simply implies difference between the two samples. Unless the proportion of a gene in the transcriptome is the same for all samples under the same condition (lack of within-class variation), we can not generalize the difference between two samples to the difference between two classes. The within-class biological variation among replicates leads to over dispersion in Binomial or Poisson models. Models accounting for over dispersion, such as a beta-binomial, have been introduced for the analysis of SAGE data when several replicates within each class are available

In this paper we present an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to estimate the log fold change of transcription between two samples. We borrow information across sequences to estimate the hyper parameters representing the normal biological variation among replicates and the distribution of a transcriptome. The statistical model does not rely on Gaussian approximation of the binomial distribution for all tags and requires no special treatment of 0 counts. Differential expression is computed in the form of a shrinkage estimate of log fold change. This estimate is the basis for ranking genes. We also compute the posterior probability that the log fold change is greater than a biologically relevant threshold chosen by the user. In contrast to sorting genes simply by p-values, we focus on the biological significance (represented by the posterior expectation of log fold change) and provide uncertainty measure in the form of posterior probability.

Modeling biological variation

It has been reported that the noise in gene expression by sequencing depends on expression level as observed in microarray data

Scatter plot of the log_{10 }rpm in two samples

**Scatter plot of the log _{10 }rpm in two samples**. A. Scatter plot of the log

Quantile quantile (QQ) plots of the differences of log rpm (log(_{1}) - log(_{2})) confirming the Gaussian distribution as a reasonable approximation of the biological variation

**Quantile quantile (QQ) plots of the differences of log rpm (log( p _{1}) - log(p_{2})) confirming the Gaussian distribution as a reasonable approximation of the biological variation**. A.QQ plots of log(

Distribution of expression levels in a transcriptome

As observed in both microarray data and sequencing-based transcriptome profiling, genes can differ by orders of magnitude in their expression levels, ranging from less than 1 per million to thousands per million and the majority of genes have relatively low counts. Tags with 0 counts cause problems in statistical analyses that take a direct log transformation and some investigators have had to develop special treatments for those genes _{10}(_{1 }∧ 0.5)_{1 }+ log_{10}(_{2 }∧ 0.5)_{2}]

Histogram of the average log_{10 }rpm between the A and B samples

**Histogram of the average log _{10 }rpm between the A and B samples**. Histogram of the average log

Results and Discussion

We applied ASC to transcription profiles of the diatom

Shrinkage estimate of log fold change

**Shrinkage estimate of log fold change**. Shrinkage estimate of log fold change from ASC plotted against apparent log fold change for all genes. The apparent log fold change is defined as

We estimated the posterior mean of log fold change and the posterior probability that there is greater than two fold change for a given tag. There are 1050 genes with posterior probability greater than 0.9 that the fold change is greater than 2. The average log rpm of those tags spread from less than 0.23 (1.7 rpm) to 3.6 (10,000 rpm) and most have approximately 1 (10 rpm). Figure

Differential expressed genes identified by ASC

**Differential expressed genes identified by ASC**. A. Scatter plot of the log_{10 }rpm in two samples. Differentially expressed genes with posterior probability of fold change greater than 2 are highlighted in red. B. Distribution of average rpm for the highlighted differentially expressed genes.

Comparison with other methods

All of the genes identified as differentially expressed by ASC have very small p-values if a simple test of equal proportions is performed. In fact, a simple Z-test identifies 3479 differentially genes at significance level 0.05 with Bonforonni correction, as highlighted in red in Figure

Differential expressed genes identified by Z-test

**Differential expressed genes identified by Z-test**. A. Scatter plot of the log_{10 }rpm in the A and B samples. Differentially expressed genes with Bonferonni adjusted p-value less than 0.05 highlighted in red and the smallest 1000 of which highlighted in blue. B. Distribution of average rpm for the highlighted red or blue genes.

**Figure S1**. A. Scatter plot of the log_{10 }rpm in the A and B samples. Differentially expressed genes identified by DGEseq with estimated q-value (Storey FDR) less than 0.01 highlighted in red and the genes with smallest q-value 1000 of which highlighted in blue. B. Distribution of average rpm for the highlighted red or blue genes.

Click here for file

ASC clearly prioritizes genes differently from Z test or DGEseq and finds more genes with modest expression but greater fold change as differentially expressed. In order to show that the top ranked genes in ASC are associated with higher biological significance, we obtained DGE data from an experiment comparing expression from two genotypes with 4 replicates each

Overlap between the top 1000 genes identified by different methods and the SAGE BetaBin ranking.

**Bayes Error by SAGE BetaBin**

**ASC**

**DGEseq**

**Z-test**

**Fisher's exact test**

**EdgeR**

≈ 0

320

189

178

180

259

≤ 0.01

391

254

242

244

332

≤ 0.05

516

404

397

398

436

The Bayes error is computed using SAGE BetaBin

We have also used edgeR

Overlap between the top 100 or top 1000 differentially expressed genes identified by edgeR on full data and by other statistics on data without replicate

**edgeR on full data**

**Without replicates**

**ASC**

**edgeR**

**DGEseq**

**Z**

**Fisher**

top 100

33

37

2

1

1

top 1000

260

263

89

82

86

Why is there so little overlap between the top genes by Z-test on two sample comparison and the top genes from edgeR analysis on the full data set? Strikingly, many genes with extreme p-values in a Z-test have small fold changes. This is because there is greater statistical power to detect even subtle changes in gene expression when the counts are higher. From the Gaussian approximation to the sample proportion _{
π
}, the log sample proportion is also approximately Gaussian, _{
π
}varies greatly from a few to over a hundred thousand, and the variance of log sample proportion decreases sharply with the increase of expected counts, it is clear that statistical power is biased towards genes with higher counts. This also causes the bias of higher power towards longer transcripts in full transcript analysis. An extreme p-value in such a test only suggests that the proportions of a transcript is significantly different between the two samples of comparison, not whether the difference is beyond what is reasonable between biological replicates. Figure

Histogram of the apparent fold change of the top 1000 genes found by Z-test or ASC

**Histogram of the apparent fold change of the top 1000 genes found by Z-test or ASC**.

**Figure S2**. Histogram of the apparent fold change of the top 1000 genes found by DGEseq or ASC.

Click here for file

Discussion

We present a simple hierarchical model for sequencing-based gene expression data (e.g. DGE, RNAseq ect.) that provides a shrinkage estimate of differential expression in the form of posterior mean of log fold change. Even in experiments lacking replicates, we take advantage of the large number of sequences quantified in the same experiment and establish a prior distribution of difference between conditions. The differential expression of a gene is evaluated based on the posterior expectation of log fold change. This estimate takes into account the increased uncertainty for genes with smaller counts (demonstrated by more aggressive shrinking in Figure

It is not uncommon to use hierarchical models for gene expression data. Several models used in microarray data analysis ^{2}) and _{0}. This essentially assumes that the prior distribution of

In biological terms, our model means that the mean gene expression levels between two populations are never absolutely equal for any gene. However, the difference for most genes are small. We use posterior expectation as the estimate of the magnitude of difference. McCarthy and Smyth

Methods

DGE data generation

The diatom _{4}) and phosphorus-replete medium (36 _{4}) were grown in triplicate and are herein referred to as treatments A and B, respectively. Equal volumes of cell biomass from each replicate were pooled for the A or B treatments 96 hours after inoculation and harvested by gentle filtration. Filters were immediately frozen in liquid nitrogen and stored at -80°C.

Total RNA was extracted using the RNeasy Midi Kit (Qiagen), following the manufacturer's instructions with the following changes: RNA samples were processed with Qiashredder columns (Qiagen) to remove large cellular material and DNA was removed with an on-column DNAase digestion using RNase-free DNAase (Qiagen). A second DNA removal step was conducted using the Turbo DNA-free kit (Ambion, Austin, TX, USA)[B1]. The RNA was quantified in triplicate using the Mx3005 Quantitative PCR System (Stratagene) and the Quant-iT RiboGreen RNA Assay Kit (Invitrogen) and was analyzed for integrity by gel electrophoresis. Total RNA was sent to Illumina (Hayward, CA) and they constructed digital gene expression (DGE) libraries with NlaIII tags following their protocol. Sequencing libraries for NlaIII digested tags were constructed by Illumina and sequenced on their Genome Analyzer. 12,525,833 tags were sequenced from the A library and 13,431,745 tags were sequenced from the B library.

Hierarchical model for gene counts

For each transcript, we assume the observed sequence counts follow a Binomial distribution given its expected expression under a biological condition. For a sequencing run that yields total count _{1 }and _{2 }while

Many researchers simply test _{1 }= _{2 }and perform a Bonferonni correction to account for multiple testing. We reparametrize _{1 }and _{2 }as follows:

Here

We assume prior distributions

where _{0}.

The posterior distribution of the differential expression is therefore

We obtain the posterior mean _{0}|**x**), where Δ_{0 }is a user-defined effect size of biological significance. There is no closed form expression for the posterior distribution and we use numerical integration for the evaluation of the posterior mean and probability.

Estimation of hyper parameters

The observed log rpm has a very skewed distribution, motivating us to use a distribution with exponential decay. But the location of this distribution is shifted compared to exponential distribution with an unknown lower bound. One advantage of the exponential distribution is the closed form expression of its cumulative density function. For _{0}), _{1}, _{2}, we can obtain empirical quantiles and

gives estimates

We can also use the method of moments to estimate the rate without knowing the shift parameter since the conditional expectation also has a closed from, due to the lack of memory property. For a given 0 <

Thus we can estimate _{1 }= 0.8 and _{2 }= 0.9. The posterior mean **x**] is not sensitive to the choice of _{1}, _{2 }(Additional file _{0 }does not affect the posterior distribution and does not need to be estimated.

**Figure S3**. Sensitivity of _{1 }= 0.8, _{2 }= 0.9 and _{1 }= .9, _{2 }= 0.95, respectively. The maximum difference in estimated fold change is less than 0.04, indicating that

Click here for file

To estimate _{1}) decreases with rate of 1/

Authors' contributions

ZW developed the statistical methodology, in consultation with BDJ and TAR, and drafted the manuscript. BDJ, TAR, STD and MAS designed the study that generated the DGE data, and contributed to writing the manuscript. MM and LPW performed experiments that generated the DGE data. All authors have read and approved the final manuscript.

Acknowledgements

We thank the reviewers for their insightful comments and suggestions that greatly strengthened the manuscript. We thank A. Drzewianowski for her assistance with laboratory experiments. Funding was provided by NSF OCE-0723677.