Open Access Highly Accessed Methodology article

A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments

Mikel Esnaola16, Pedro Puig2, David Gonzalez3, Robert Castelo45* and Juan R Gonzalez1256*

Author Affiliations

1 Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain

2 Department of Mathematics, Universitat Autònoma de Barcelona (UAB), Barcelona, Spain

3 Center for Genomic Regulation (CRG), Barcelona, Spain

4 Department of Experimental and Health Sciences, Research Program on Biomedical Informatics (GRIB), Universitat Pompeu Fabra, Barcelona, Spain

5 Hospital del Mar Research Institute (IMIM), Barcelona, Spain

6 CIBER Epidemiology and Public Health (CIBERESP), Barcelona, Spain

For all author emails, please log on.

BMC Bioinformatics 2013, 14:254  doi:10.1186/1471-2105-14-254

Published: 21 August 2013

Abstract

Background

High-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two well-known count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNA-seq data sets is not trivial for the non-expert bioinformatics user.

Results

Here we show that expression profiles produced by extensively-replicated RNA-seq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as Poisson-Inverse Gaussian or Pólya-Aeppli, which can be captured by a more general family of count data distributions called the Poisson-Tweedie. The flexibility of the Poisson-Tweedie family enables a direct fitting of emerging features of large expression profiles, such as heavy-tails or zero-inflation, without the need to alter a single configuration parameter. We provide a software package for R called tweeDEseq implementing a new test for differential expression based on the Poisson-Tweedie family. Using simulations on synthetic and real RNA-seq data we show that tweeDEseq yields P-values that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sex-specific gene expression changes in human lymphoblastoid cell lines, we also show that tweeDEseq accurately detects differentially expressed genes in a real large RNA-seq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility.

Conclusions

RNA-seq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNA-seq samples with markedly different count data distributions. The tweeDEseq package forms part of the Bioconductor project and it is available for download at http://www.bioconductor.org webcite.