Abstract
Background
Highthroughput RNA sequencing (RNAseq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNAseq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two wellknown count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNAseq data sets is not trivial for the nonexpert bioinformatics user.
Results
Here we show that expression profiles produced by extensivelyreplicated RNAseq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as PoissonInverse Gaussian or PólyaAeppli, which can be captured by a more general family of count data distributions called the PoissonTweedie. The flexibility of the PoissonTweedie family enables a direct fitting of emerging features of large expression profiles, such as heavytails or zeroinflation, without the need to alter a single configuration parameter. We provide a software package for R called tweeDEseq implementing a new test for differential expression based on the PoissonTweedie family. Using simulations on synthetic and real RNAseq data we show that tweeDEseq yields Pvalues that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sexspecific gene expression changes in human lymphoblastoid cell lines, we also show that tweeDEseq accurately detects differentially expressed genes in a real large RNAseq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility.
Conclusions
RNAseq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNAseq samples with markedly different count data distributions. The tweeDEseq package forms part of the Bioconductor project and it is available for download at http://www.bioconductor.org webcite.
Background
Highthroughput gene expression profiling across samples constitutes one of the primary tools for characterizing phenotypes at molecular level. One of the main advantages of the rapidly evolving massive scale cDNA sequencing assay for this purpose (RNAseq [1]), over the hybridizationbased microarray technology, is a much larger dynamic range of detection. However, the extent to which this feature is fully exploited depends entirely on the way the resulting data is analyzed when addressing a particular biological question. For instance, in the identification of genes that significantly change their expression levels between groups of samples, also known as differential expression (DE).
For DE analysis, after some preprocessing steps that include the alignment of the sequenced reads to a reference genome and their summarization into features of interest (e.g., genes), raw RNAseq data is transformed into an initial table of counts. This table should then be normalized [24] in order to adjust for both technical variability and the expression properties of the samples, such that the estimated normalization factors and offsets applied to the RNAseq count data describe as accurately as possible the relative number of copies of each feature throughout every sample. As opposed to the continuous nature of logscale fluorescence units in microarray data, RNAseq expression levels are defined by discrete count data, and therefore, require specific DE analysis techniques.
Detection of DE genes using RNAseq data was firstly based on using models assuming a Poisson distribution [5] with one single parameter, the mean, which simultaneously determines the variance of the distribution. Given that the observed variation in read counts is much larger than the mean (overdispersion), researchers have proposed the use of negative binomial (NB) distributions [68] which are defined by two parameters: the mean and the dispersion. However, the larger power of RNAseq to capture biological variability can potentially introduce into count data not only overdispersion, but also oddities such as zeroinflation (i.e., in lowly expressed genes, the proportion of zero counts may be greater than expected under an NB distribution) and heavy tail behavior (i.e., a large dynamic range within the same expression profile), specially when many biological replicates are available. Under these circumstances even a twoparameter NB distribution may not provide an adequate fit to the data (see Figure 1). In turn, this may lead to incorrect statistical inferences resulting in lists of DE genes with a potentially increased number of false positive calls and poor reproducibility. To overcome this problem, methods based on the NB distribution [611] use sophisticated moderation techniques that borrow information across genes and exploit the meanvariance relationship in count data to improve the estimation of the NB dispersion parameter. This requires, however, that the parameter configuration is calibrated for the most appropriate moderation regime which may depend on features such as sample size, the magnitude of the foldchange, the variability of expression levels, the fraction of genes undergoing differential expression and the overall expression level.
Figure 1. Fit of different count data distributions to diverse RNAseq gene expression profiles. Fit of different count data distributions to the female (a, c, e) and male (b, d, f) RNAseq expression profiles of genes EEF1A2(a, b), SCT(c, d) and NLGN4Y(e, f). All plots show the empirical cumulative distribution function (CDF) of counts (black dots) and the CDF estimated by a pure negative binomial model (black dashed line), a PoissonTweedie model (red line) obtained with tweeDEseq and several moderated negative binomial models obtained with different parameter configurations of DESeq and edgeR. Estimated dispersions, and shape in the case of tweeDEseq, are indicated in the legend. Above the legend, the Pvalue of the test of goodnessoffit to a negative binomial distribution is shown. According to this test, expression profiles in panels (a, b, c and e) do not follow a negative binomial distribution. Female samples display nonnegative binomial features such as a heavytail (a, c) and zeroinflation (c, e). Gene NLGN4Y is documented in the literature as a gene with sexspecific expression, while the other two are not (EEF1A2 is a housekeeping gene and SCT is an endocrine hormone peptide in chromosome 11 that controls secretions in the duodenum).
In this paper we propose to approach this problem by using other count data distributions that fit expression profiles better than the NB without the need to alter configuration parameters. The rest of the paper is organized as follows. Using a large RNAseq data set of HapMap lymphoblastoid cell lines (LCLs) derived from n=69 unrelated Nigerian (YRI) individuals [12], we start by assessing the goodness of fit of extensively replicated expression profiles to the NB distribution, showing a lack of fit for an important fraction of genes. We illustrate how a more flexible family of countdata probability distributions, called the PoissonTweedie, provides a better fit to these expression profiles. We provide data supporting the hypothesis that the lack of fit to NB distributions may be related to the dynamics of gene expression unveiled by RNAseq technology. We then introduce a new test for differential expression analysis in RNAseq data based on the PoissonTweedie family of distributions. We demonstrate with simulations on synthetic and real RNAseq data how a single run of our approach provides Pvalues that are equally or more accurate than NBbased competing methods calibrated with a variety of configuration parameters. Finally, by surveying the tiny fraction of sexspecific gene expression changes in LCL samples, we approach the problem of assessing accuracy in DE analysis with real RNAseq data and show that, in the context of extensively replicated RNAseq experiments, tweeDEseq yields better performance than competing NBbased methods without the need to make an informed decision on the configuration of parameters. This improvement is reported in terms of precision and recall of DE genes and reproducibility of the significance levels with respect to matching microarray experiments.
Results and discussion
The results we provide in this paper are based on data from a previously published large RNAseq experiment [12] and on our own simulated count data. We downloaded and preprocessed the HapMap LCL raw RNAseq data, consisting of n=69 samples from unrelated YRI individuals, with our own pipeline (see Methods). The resulting table of counts consists of 38,415 genes by 69 samples. We filtered out genes with very low expression levels and used different normalization methods [2,4] (see Methods) to ensure that the results described below do not depend on this fundamental step. In fact, we have observed that normalized counts can lead to quite different MAplots depending on the normalization method, thus potentially affecting DE detection power and accuracy (Figure 2).
Figure 2. Count data normalization. MAplots of the count data corresponding to the YRI samples from Pickrell [12]et al. (2010) after applying the following normalization methods: (a) raw count data without any normalization; (b) normalization with the edgeR[2] package; and (c) normalization with the cqn[4] package. The xaxis (A) shows the average expression throughout female and male samples in log2 scale and the yaxis (M) shows the magnitude of the log2fold change between female and male samples.
The statistical methods proposed in this paper are implemented in a package for the statistical software R, called tweeDEseq which forms part of the Bioconductor project [13] at http://www.bioconductor.org webcite. We have also created an experimental data package, called tweeDEseqCountData, which contains the previously described data set and is also available at the same URL. All results presented in the paper were obtained using these and other packages from R version 2.15.1 and Bioconductor version 2.11, and can be reproduced through the scripts available as Additional file 1 to this article.
Additional file 1. Scripts. ZIP file (.zip) containing all scripts, in the form of Sweave vignettes, to reproduce the results shown in this paper, including one copy of the resulting PDF file. Please read first through the README file contained in this tar ball in order to understand how to run the scripts.
Format: ZIP Size: 4.4MB Download file
Review of competing methods
There is currently a large body of literature on DE analysis methods for RNAseq data [511,14], nearly all of them based on the NB distribution and developed to deliver their best performance with few replicates. Anders et al. (2010) [7] argued that for large number of individuals “... questions of data distribution could be avoided by using nonparametric methods, such as rankbased permutation tests”. However, rankbased methods require similar count data distributions between sample groups. Due to the large variability across groups [15] captured by RNAseq data, this assumption will most likely be broken in this context. For example, panels ef in Figure 1 illustrate the case of gene NLGN4Y (ENSG00000165246), a gene located in the malespecific region of chromosome Y and reported to have sexspecific expression, which shows remarkably different count data distributions between male and female samples. Permutation tests are also underpowered since distribution tails are not well estimated (due to the large dynamic range), which is important when correcting for multiple testing.
In this paper we will focus our comparisons on the two most widely used methods for DE analysis of RNAseq data, edgeR[6,8,10] (version 3.0.8) and DESeq[7] (version 1.10.1) and explore those parameter configurations in these methods that we found most relevant for large RNAseq data sets, according to the available documentation. Both, edgeR and DESEq, assume that expression profiles from RNAseq data follow an NB distribution and borrow information across genes to first estimate a common dispersion parameter. Then, for each gene, they estimate its genewise dispersion and moderate it towards the common one. The way in which this moderation takes place depends on the method and its configuration parameters. DESeq[7] allows switching between common (sharingMode=~fitonly~) and genewise (sharingMode=~geneestonly~) dispersions. It provides a straightforward strategy (sharingMode=~maximum~, default configuration) to choose between common and genewise dispersions by taking the largest value for each gene. edgeR allows one to calibrate, using the prior.df parameter, the transition from a purely genewise dispersion estimate (small values of prior.df) to the common one (large values of prior.df) by using an empirical Bayes approach. By default prior.df=20 which implies that a large weight is given to the common dispersion. However, according to the documentation, if the number of samples is large, the common dispersion becomes less important in the moderation step. Additional options in DESeq and edgeR that may be relevant in the context of large RNAseq data sets are, in the case of DESeq, whether dispersions are estimated from the entire pool of samples (method=~pooled~, its default) or separately per sample group (method=~percondition~). In the case of edgeR, whether the DE test is performed using a likelihood ratio test (glmLRT() function) or a quasilikelihood Ftest [8] (glmQLFTest() function), after dispersions are estimated. Table 1 summarizes these eight combinations of methods and parameter configurations and contains the key to the terms used in some figures to distinguish among them.
Table 1. Methods and parameter configurations compared in this paper
Different gene expression dynamics require different distributional assumptions on count data
We assessed the goodnessoffit of every expression profile in the LCL RNAseq data to an NB distribution (see Methods) by means of quantilequantile (QQ) plots (Figure 3) and about 10% of the genes show a substantial discrepancy with respect to the NB distribution in the counts (see right yaxis in Figure 3). Such a discrepancy is absent from data simulated from NB distributions with a similar number of genes including a small fraction of them changing between two conditions (Additional file 2: Figure S1).
Figure 3. Goodness of fit to the negativebinomial distribution. Quantilequantile (QQ) plots of the goodnessoffit of RNAseq expression profiles from Pickrell [12]et al. (2010) to a negativebinomial (NB) distribution. The right yaxis indicates the quantile of the observed distribution. Columns correspond to different normalization methods where (a, d) correspond to raw unnormalized counts, (b, e) normalization with edgeR and (c, f) normalization with cqn. The top row (a, b, c) contains the QQ plots of the χ^{2} goodnessoffit statistic while the bottom row (d, e, f) contains the same QQ plot mapped to a normalized Zstatistic to improve the visibility of the left tail of the distribution. Independently on how count data are normalized, about 10% of the expression profiles show a substantial discrepancy to the NB distribution.
Additional file 2. Supplementary materials. PDF file including supplementary figures and tables.
Format: PDF Size: 2.1MB Download file
This file can be viewed with: Adobe Acrobat Reader
This result suggests that NB distributions may be too restrictive for an important fraction of expression profiles in large RNAseq data sets. Among the possible causes underlying the lack of fit of those genes to an NB distribution, a clear candidate is that the presence of many samples can potentially introduce features such as zeroinflation or heavytails (see Figure 1). So far, extensive biological replication in RNAseq experiments has been the exception rather than the rule. However, it is becoming increasingly clear [15] that in the coming years larger RNAseq data sets will be required to justify scientific conclusions and provide reproducible results. Therefore, we can expect to see more often gene expression profiles with emerging features, such as zeroinflation and heavy tails, that challenge RNAseq methods based on the NB distribution.
We propose to address this problem by adopting the PoissonTweedie (PT) family of distributions [16] to model RNAseq count data directly. PT distributions are described by a mean (μ), a dispersion (ϕ) and a shape (a) parameter (see Methods) and include Poisson and NB distributions, among others, as particular cases [16]. An important feature of this family is that, while the NB distribution only allows a quadratic meanvariance relationship, the PT distributions generalizes this relationship to any order [17]. We have implemented a maximum likelihood procedure for the estimation and simulation of these parameters from count data. These procedures are available in the tweeDEseq package through the functions mlePoissonTweedie(), dPT() and rPT().
Figure 1 illustrates the flexibility of the PT distribution to accurately fit different gene expression profiles obtained from the unnormalized LCL RNAseq data set. Left and right panels correspond to female and male samples, respectively and each row corresponds to a different gene: EEF1A2 (ENSG00000101210), SCT (ENSG00000070031) and NLGN4Y (ENSG00000165246), respectively. Among these three genes, only NLGN4Y has been reported in the literature to have sex specific expresssion, while the other two are likely to lack such property since EEF1A2 is a housekeeping gene and SCT is an endocrine hormone peptide in chromosome 11 that controls secretions in the duodenum. Each plot shows the empirical cumulative distribution of observed counts as well as the parametric cumulative distributions obtained through the estimation of parameters of the methods compared in this paper under different configurations. Note that the estimated dispersion parameter ϕ is identical between the two sample groups for edgeR and DESeq (pooled) as these approaches estimate ϕ irrespective from the sample groups. The Pvalue for testing whether the data follow an NB distribution (H_{0}:a=0), indicated above the legend, reveals that in several sample groups (panels ac, e) this hypothesis is rejected (P<0.05). In those cases, methods based on the NB distribution produce dispersion parameters that do not fit the data as accurately as the PT distribution. More concretely, heavytails present in panels a,c severely hamper the estimation of the pure NB and the common dispersion. These can be improved using a parameter configuration more suited to large sample sizes. However, this results in a poor estimate of zeroinflation in panels ce.
The main difference between the PT and NB distributions lies in the additional “shape” parameter a of the PT distribution which provides further flexibility (see Methods). Using the LCL data processed with different normalization methods, we show in Figure 4 all values of the shape parameter a for every gene as function of its mean expression level, illustrating the huge variability of this parameter in RNAseq count data. This wide range of values involves distinct possible distributional assumptions [16] beyond Poisson and NB, such as PoissonInverse Gaussian, PólyaAeppli or Neyman type A. Similarly to the MAplots of Figure 2, the cqn normalization method seems to make the largest impact on count data and, in this case, on the shape parameter.
Figure 4. Distribution of the PoissonTweedie shape parameter as function of the mean expression level. Estimated PoissonTweedie shape parameter a as function of the mean expression level for each gene. Red dashed lines indicate the value of a corresponding to each specific distribution within the PoissonTweedie family, denoted by Pois (Poisson), PIG (PoissonInverse Gaussian), NB (negative binomial), PA (PólyaAeppli) and NtA (Neyman type A). The right yaxis indicates the percentage of genes around specific a values bounded by dotted grey lines. Data from Pickrell [12]et al. (2010) are shown without any normalization (a), normalized with edgeR[2](b), and normalized with cqn[4](c).
We have investigated whether this diversity of count distributions underlying RNAseq data is related to different expression dynamics in genes. Using the test for the goodness of fit to an NB distribution (see Methods) we have considered as NB those genes that do not reject the null hypothesis at P>0.2 and as clearcut nonNB genes those with P<2^{−16}. By mapping all these genes to the Gene Expression Barcode catalog [18] (see Methods) we obtained an independent and unbiased estimation of their expression breadth. The results in Figure 5 suggest that the expression breadth of nonNB genes approaches that of housekeeping genes closer than NB genes do, irrespective of the normalization method.
Figure 5. Expression dynamics of genes with different count data distributions. Empirical cumulative distributions of the breadth of expression estimated through the Barcode [18] database, for genes that do not reject the null hypothesis of a negativebinomial (NB) distribution in a test for the goodness of fit at P>0.2 (green lines), genes that do reject such a null hypothesis at P<2^{−16} (blue lines) and housekeeping genes retrieved from literature [19] (red lines). Data from Pickrell [12]et al. (2010) are shown without any normalization (a), normalized with edgeR[2](b) and normalized with cqn[4](c). These plots show that, independently of the normalization method, nonNB genes at such significance level of discrepancy with respect to the NB distribution approach closer the expression dynamics of housekeeping genes than genes with expression profiles following the NB distribution.
In fact, Fisher’s exact tests for enrichment of nonNB genes among human housekeeping genes are significant (P<1.24^{−6}) for every normalization method (see Additional file 2: Table S1). These observations suggest that genes with different expression dynamics can produce different count data distributions, and underscore the flexibility of the PT statistical model to capture these dynamics revealed by extensivelyreplicated RNAseq experiments.
Accurately testing differential expression
For the purpose of a DE analysis between two groups of samples, we have developed a twosample PTtest for differences in means (see Methods) implemented through the function tweeDE() in the tweeDEseq package. We will assess the accuracy of this PTbased test using the LCL data as well as synthetic count data from two different simulation studies. The first simulation study with synthetic data provides an assessment of the typeI error rate under four different scenarios involving distinct count data distributions between sample groups (see Additional file 2: Table S2 for a description of them). Here we compare tweeDEseq with the configurations of edgeR and DESeq which are closer to a straightforward NB model. Additional file 2: Figures S2 to S5 show that tweeDEseq properly controls the nominal probability of a typeI error while edgeR, DESeq and nonparametric tests (U MannWithney and permutation) fail to do so when data are not simulated from NB distributions. As expected, these methods perform correctly when data are generated under an NB model (see Additional file 2: Figure S5) as expected. Additional file 2: Figure S6 also shows that in the calculation of very low Pvalues, tweeDEseq clearly outperforms permutations tests. In order to provide a practical recommendation on the minimum sample size required by tweeDEseq to yield accurate results we have estimated the probability of a typeI error across different sample sizes. Additional file 2: Figure S7 indicates that 15 samples per group should be sufficient for tweeDEseq to correctly control for a nominal significance level α=0.05.
In the second simulation study we have first assessed the accuracy of the Pvalue distribution under the null hypothesis of no differential expression with real RNAseq data by making repeatedly twosample group comparisons within males and within females samples such that we recreate the null hypothesis of no DE with real RNAseq data and no DE gene should be expected to be found. The raw Pvalue distributions from such analysis should ideally be uniform.
We have formally tested this hypothesis for every gene by means of a KolmogorovSmirnov (KS) goodnessoffit test to a uniform distribution and examine the resulting Pvalue distribution by means of QQ plots displayed in Figure 6. Under the null hypothesis that all genes are not DE, the KS Pvalues should lie along the diagonal of the QQ plot. The figure, however, shows large discrepancies to this criterion by some of the methods and configuration parameters, indicating that they may not be adequate for large RNAseq data sets.
Figure 6. Quantilequantile (QQ) plots for the goodnessoffit of nullhypothesis Pvalues to an uniform distribution. Using the results displayed in Additional file 2: Figure S8 and performing as described by Leek et al. (2007) [20], for each gene, the distribution of Pvalues throughout the 100 simulations was tested for its goodnessoffit to an uniform distribution using a KolmogorovSmirnov test. QQ plots in this figure show for all genes the resulting Pvalues of the previous test which, under the null hypothesis of no differential expression, should be uniformly distributed too and lead to lines lying on the diagonal. Panels ab show results from female vs female comparisons and cd from male vs male comparisons, while a,c correspond to unnormalized data and b,d to data normalized with the cqn[4]. The method introduced in this paper, tweeDEseq, is on average closer to the diagonal throughout the four simulations, closely followed by DESeq when sharingMode=~geneestonly~ and either method=~percondition~ or method=~pooled~.
The method introduced in this paper, tweeDEseq, is consistently closer to the diagonal than every other method throughout the two male and female comparisons and the two normalization methods. More informally, the visual inspection of the histograms of raw Pvalues given in Additional file 2: Figure S8 also reveals that tweeDEseq provides Pvalue distributions closer to the uniform under the null hypothesis of no DE simulated from extensively replicated real RNAseq data.
As other authors have shown, in the context of analysis of RNAseq data with very limited sample size [8], small deviations from uniformity of Pvalues under the null hypothesis can substantially affect FDR estimates of DE genes. We have also assessed the calibration of Pvalues and false discovery rates (FDR) with synthetic count data of similar dimensions to the RNAseq LCL data set, concretely with p=20,000 genes and n=70 samples. Working with this type of data allows to assess FDR estimates for a known subset of DE genes under a variety of simulated scenarios, which we defined by considering the combination of three different amounts of DE genes (100, 1000 and 2000) and three different magnitudes of foldchange (1.5, 2 and 4fold). Similarly to [8], data were simulated from a hierarchical gammaPoisson model with and without simulated library factors (see Methods).
From every simulated data set, raw Pvalues for the twosample DE test were obtained with each method and configuration parameters. Using the qvalue Bioconductor package [21] we estimated qvalues and the fraction of DE genes from each Pvalue distribution. Qvalues provide a nominal estimation of the FDR for each gene and in Figures 7 and 8 we show the empirical FDR (eFDR) as a function of the nominal qvalues for the simulations with constant and variable library factors, respectively. The dashed diagonal line indicates a correct calibration of Pvalues whose nominal FDR equals the observed eFDR. Lines above the diagonal correspond to liberal DE analysis methodologies and below to conservative ones.
Figure 7. Empirical FDR values for simulated data with constant library factors. Empirical FDR values on the yaxis as function of nominal qvalues on the xaxis calculated from data simulated with p=20,000 genes, n=70 samples and constant library factors. Each row and column corresponds, respectively, to a different number of DE genes and magnitude of the foldchange. The method introduced in this paper, tweeDEseq, is consistently closer to the diagonal than other methods throughout the different simulations.
Figure 8. Empirical FDR values for simulated data with variable library factors. Empirical FDR values on the yaxis as function of nominal qvalues on the xaxis calculated from data simulated with p=20,000 genes, n=70 samples and variable library factors. Each row and column corresponds, respectively, to a different number of DE genes and magnitude of the foldchange. The method introduced in this paper, tweeDEseq, is consistently closer to the diagonal than other methods throughout the different simulations.
To facilitate the comparison of methods across all simulated data sets we have calculated the mean squared error (MSE) between the eFDR and the nominal FDR and ranked the methods by increasing MSE. In Tables 2 and 3 we can find the MSE values and in Tables 4 and 5 the corresponding ranks of the methods according to the MSE values. As it follows from the rankings in Tables 4 and 5, tweeDEseq provides the best calibrated Pvalues in most of the simulated data sets.
Table 2. Mean squared error of false discovery rates under constant library factors
Table 3. Mean squared error of false discovery rates under variable library factors
Table 4. Rankings of methods by the mean squared error of false discovery rates under constant library factors
Table 5. Rankings of methods by the mean squared error of false discovery rates under variable library factors
The previous calculations of qvalues with the qvalue package [21] provide us also with estimates of the fraction of genes under the null hypothesis of no differential expression. This, in turn, allows one to derive an estimated number of DE genes as with p being the total number of genes. In principle, more precise Pvalues both under the null and the alternative hypotheses should provide more accurate estimates of the number of DE genes. We show such an assessment for the previous simulations in Additional file 2: Figures S9 and S10. To summarize those results we have divided each estimate of the number of DE genes by their actual simulated number of DE genes and aggregate those ratios throughout the different simulation scenarios to ease the comparison among the methods. We find this comparison in Figure 9 and it follows that tweeDEseq produces Pvalues that lead to the most accurate estimation of the number of DE genes, closely followed by edgeR with prior.df=1 when library factors are not held constant. In both settings, DESeq leads to extremely conservative estimates of the number of DE genes.
Figure 9. Estimation of the number of differentially expressed (DE) genes from simulated data. Boxplots of ratios of estimated to true numbers of DE genes obtained from data simulated from a hierarchical gammaPoisson model with constant (a) and variable (b) library factors. This figure summarizes the results in Additional file 2: Figures S9 and S10 reporting estimated numbers of DE genes under different simulated scenarios of number or true DE genes and foldchange. The horizontal dash line at ratio one corresponds to the best performance where the estimated number of DE genes matches the true number.
Identification of sexspecific gene expression in lymphoblastoid cell lines
Assessing performance of DE analysis methods without using simulated data is a challenging problem due to the difficulty of knowing or ensuring the exact differential concentration of RNA molecules in the analysed samples. In this respect, sexspecific expression constitutes a useful system to assess the accuracy of DE detection methods due to the vast literature on genes contributing to genderspecific traits. For this reason, in order to illustrate the accuracy of tweeDEseq with real RNAseq data, we have searched for genes changing significantly their expression between female and male individuals of the RNAseq experiments on LCLs analyzed in this paper. Again, we have compared different normalization procedures and parameter configurations of edgeR and DESeq. Next to considering the raw unnormalized data and the data normalized with cqn, TMM normalization was used for edgeR and tweeDEseq, while DESeq was used with its own normalization method. We have used a single significance cutoff of FDR <0.1 at which genes were called DE. Since LCLs come from a nonsexually dimorphic tissue and are outside their original biological context, the fraction of sexspecific expression changes we could expect should be rather small.
In an attempt to verify the accuracy of these lists of DE genes between female and male individuals, we searched for genes reported in the literature to be potential contributors to sexually dimorphic traits. This list of genes with documented sexspecific expression was obtained from genes in chromosome X that escape Xinactivation [22] and from genes in the malespecific region of the Y chromosome [23] (see Methods). This resulted in a goldstandard set of 95 genes mapping to Ensembl Gene Identifiers (release 63), which we shall denote by XiE and MSY genes, depending on their origin. For every predicted set of DE genes by each combination of DE detection method and normalized data set, we calculated precision and recall with respect to the goldstandard, and the Fmeasure which summarizes the tradeoff between these two diagnostics.
In Figure 10 we can see that tweeDEseq provides better performance than the other competing methods under different parameter configurations. The improvement is small with respect to the second bestperforming method and parameter configuration but we would like to emphasize that tweeDEseq does not require any informed decision on a parameter configuration, as opposed to edgeR and DESeq. To assess the robustness of this figure, we have run this comparative assessment with a more stringent filter on lowly expressed genes and, as Additional file 2: Figure S11 shows, tweeDEseq keeps performing better than the other methods, this time however only when data are normalized.
Figure 10. Precision and recall comparison on the LCL RNAseq data. Precision (yaxis) and recall (xaxis) values for genes called DE at 1% FDR by different DE detection methods and configuration parameters. The right yaxis indicates values of the Fmeasure shown by dot lines. As the figure shows, tweeDEseq provides higher Fmeasure values than other methods indicating a better precisionrecall tradeoff.
In Additional file 2: Table S3 we report the complete list of 55 DE genes detected by tweeDEseq from the data normalized with cqn, which is when it yields the best precisionrecall tradeoff. More than a half of genes in this list (32) are located in either the X or Y chromosomes and where the first 10 with largest foldchange contain 7 from the goldstandard set of MSY and XiE genes. Among the other 3, we find TTTY15, a testisspecific noncoding transcript from the Y chromosome and the other two lack functional annotation in Ensembl release 63.
Reproducibility with respect to microarray data
The YRI LCL samples we have analyzed have been previously assayed using microarray chips [24] and this enables a comparison between the gene expression read out of both technologies. In particular, we wanted to assess the degree of reproducibility of the significance levels of DE. While there may be many aspects from both technologies that can potentially bound the extent to which we can reproduce rankings of DE, we postulate that more accurate Pvalues in DE genes should lead to higher reproducibility of significance levels of DE genes.
With this purpose, we applied limma[25] on the microarray data and called genes DE at 10% FDR, just as we did with RNAseq data, and then compared the − log10 units of the raw Pvalues from DE genes called in RNAseq by each DE detection method to the − log10Pvalue units from genes called DE by limma. In Additional file 2: Figure S12 we show this comparison for every gene that is called DE either by limma in microarray data or by the other compared method in RNAseq data. Although we can observe a significant linear relationship between Pvalues in every compared method, the low fraction of variability explained by the fitted linear model (R^{2}<0.25) in every of those comparisons indicates a rather poor level of reproducibility for every method.
A closer look to genes in that figure indicates that the lack of reproducibility mostly comes from genes called DE by one method and technology but not by the other (dots close to zero in either the x or the yaxis). There may be many reasons, unrelated to the DE detection method itself, why a gene is not called simultaneously DE in two completely independent RNAseq and microarray experiments on the same biological material, such as different isoforms being probed in the microarray and summarized in RNAseq or differences in sample preparation. Therefore, for our current goal of assessing reproducibility of DE detection methods, we believe it makes sense to restrict this comparison to those genes that are called DE by both, limma in microrray data and the corresponding method in RNAseq data.
We can find this restricted comparison in Figure 11 which reveals that in this case only tweeDEseq attains a significant (P<0.05) linear fit with respect to the Pvalues from limma with a level of reproduciblity (R^{2}=0.6) substantially larger (46% increase) than the second best method (DESeq  PgO) with R^{2}=0.41.
Figure 11. Reproducibility of differential expression (DE) between microarray and RNAseq. Raw Pvalues of differential expression in − log10 scale for DE genes called at 10% FDR by both, limma (yaxis), from microarray data, and the other compared DE detection method applied on RNAseq data (xaxis). A regression line is depicted in red. On the bottomright corner of each panel, ρ indicates the Pearson correlation whereas R^{2} and P indicate, respectively, the coefficient of determination and Pvalue of the test for zero regression coefficient, of the − log10pvalues of limma as function of those from the compared RNAseq method. Only tweeDEseq provides a significant (p<0.05) level of reproducibility between Pvalues of DE genes reported by both, limma on microarray data and the compared RNAseq method, attaining also the highest ρ and R^{2} values. Blue dots indicate genes with documented sexspecific expression.
Finally, we have carried out a comparison between the entire output of DE genes obtained with tweeDEseq in RNAseq data with the entire output DE genes obtained with limma in microarray data. In Figure 12 we show the resulting volcano plots where we have highlighted with black dots those genes that are exclusively profiled by each technology. As the figure suggests, many more of these genes occur in RNAseq than in microrray, one remarkable case being the XIST gene which shows the largest foldchange and significance level and corresponds to the Xinactive specific noncoding RNA gene which acts as one of the key regulators in silencing one of the copies of chromosome X in females. Blue and red circles denote MSY and XiE genes, respectively. As expected, all MSY and XiE DE genes report significantly higher expression in males and females, respectively, except for the XiE gene NLGN4X in RNAseq, likely due to low expression from the inactive X chromosome in female samples [26]. Surprisingly the volcano plots show that limma on this microarray data set is able to detect a few more such genes than tweeDEseq on RNAseq data. Last, but not least, an important difference between the volcano plots of Figure 12 is the fact that expression changes larger than 2fold in these microarray data are nearly synonymous of statistical significance while with RNAseq a sizeable fraction of genes with 2fold or larger changes show very poor significance levels. This is likely due to the larger variability of gene expression measurements in RNAseq experiments with many samples and underscores the importance of using methods that properly assess the statistical significance of the observed changes.
Figure 12. Comparison of DE analyses between microarray and RNAseq. Volcano plots of DE analyses performed on matching LCL samples profiled with RNAseq (a) and gene expression microarrays (b). The xaxis corresponds to log2 foldchanges between female and male individuals while the yaxis corresponds to − log10pvalue of significance. RNAseq data were analysed with tweeDEseq while microarray data were analysed with limma. Grey dots indicate genes common to both, the RNAseq and the microarray gene expression matrices, while black dots indicate genes occurring exclusively in one of the two data sets. Blue and red circles indicate genes documented in the literature with sexspecific expression, concretely belonging to the malespecific region of chromosome Y and escaping Xchromosome inactivation in females, respectively.
Conclusions
The increased amount of biological variability revealed by extensive replication in RNAseq experiments brings new challenges to the task of identifying genes whose change in expression is both, biologically and statistically significant. In microarray data, large foldchanges derived from large data sets were nearly synonymous of statistical significance. The volcano plots in Figure 12 and the examples of specific genes in Figure 1 illustrate why this is not true anymore with RNAseq count data. Those figures unveil that one of these new challenges is to distinguish statistically significant changes among those that are already large in magnitude. In this paper we provide an approach to this problem by using the PT family of distributions, showing that it captures a much richer diversity of expression dynamics in RNAseq count data than the statistical models based in the NB distributions alone (see Figures 4 and 5). We have implemented a twosample PTtest in a software package for R, called tweeDEseq, for detecting DE genes and demonstrated with simulations that produces more accurate Pvalue distributions that lead to better calibrated qvalues and FDR estimates.
We have made an attempt to assess DE detection accuracy with real RNAseq data by comparing male and female LCL samples normalized with three different methods and comparing the results to a goldstandard set of genes with documented sexspecific expression. This assessment also shows that tweeDEseq provides a better precisionrecall tradeoff than the compared NBbased methods (see Figure 10 and Additional file 2: Figure S11). We have also made a comparison with matching samples hybridised on microarray chips which allowed us to verify that tweeDEseq yields a higher degree of reproducibility of significance levels with respect to microrray data.
All these different comparative assessments have been performed against two of the most widely currently used methods for DE analysis of RNAseq data, edgeR and DESeq, under four different parameter configurations each, since their default parametrisation is tailored towards very limited sample size. Making an informed decision on what is the most appropriate setup is not trivial for the nonexpert user and, for this reason, it is important to underscore that tweeDEseq is competitive with all of the methodologies that follow from the different configurations of edgeR and DESeq without the need to set a single parameter.
The fact that the volcano plots from tweeDEseq and limma, derived from RNAseq and microarray data, reveal that limma is able to find a larger number of DE genes from the goldstandard, suggests a long way still ahead of us to fully exploit the RNAseq technology for DE. Not only regarding experimental aspects, but also statistical ones such as properly detecting and adjusting for unwanted sources of nonbiological variability, for which there is currently no wellestablished available techniques, as in the case of microarray data.
Other applications of highthroughput sequencing technology that output counts of molecules, like in Copy Number Variation (CNV) analysis, could potentially benefit of models based on the PTdistribution. It is our perception that richer count data models of this kind will become increasingly necessary to draw accurate conclusions from data as technology brings us closer the actual biology of the cell.
Methods
Preprocessing of RNAseq data
We have analyzed data from Pickrell et al. (2010) [12] that sequenced RNA from LCLs in 69 Nigerian (YRI) [12] individuals. Raw reads were downloaded from http://eqtl.uchicago.edu/RNA_Seq_data/unmapped_reads webcite and preprocessed using the GRAPE pipeline [27]. This pipeline consists of first mapping the reads to the human genome version hg19 using the GEM mapper software [28]. Second, mapped reads were summarized into genelevel counts according to the GENCODE annotation version 3c [29] with Ensembl release 63 gene identifiers, by selecting those reads that mapped either completely within an exon or spanning a junction. This resulted in an initial table of counts of 38,415 Ensembl genes. This table of counts form part of the experimental data package tweeDEseqCountData available at http://www.bioconductor.org webcite under the name pickrell1.
The table of counts was filtered to discard lowly expressed genes by keeping only those with an average of more than 0.1 counts per million (CPM) throughout the samples. The results shown in Additional file 2: Figure S11 were obtained by applying a more stringent minimum cutoff of 0.5 CPM. When we applied a normalization method that adjusted for gene length and G+C content (see below), genes without this information were also discarded. When the minimum CPM was 0.1, then 31,226 genes were kept when no normalization method or edgeRTMM was applied and when cqn was applied then 27,438 were kept (see pg. 5 and 6 from Additional file 1). When the minimum CPM was 0.5 then these numbers decreased to 19,166 and 18,009 genes, respectively.
Three approaches to normalizing the table of counts from the LCL data have been considered. The first one is to work with the initial table of raw counts without any kind of normalization, the second one is to apply TMM [2] normalization as implemented in the edgeR[30] package, the third one is to use the methodology implemented in the cqn[4] Bioconductor package which adjusts for samplespecific effects of gene length and G+C content of every gene. When using the DESeq method for DE analysis in the LCL samples, the TMM normalization procedure was replaced by its own normalization procedure.
Raw counts were transformed into filtered and normalized counts for the purpose of producing MAplots (Figure 2), assessing goodness of fit to the NB distribution (Figure 3), examining the relationship between mean expression level and the shape parameter of the PT distribution (Figure 4) and doing DE analysis with tweeDEseq. In the case of DESeq raw counts were transformed into normalized counts only when used with the cqn normalization method.
In the case of edgeRTMM normalization, counts were transformed following the steps that the function exactTest() in edgeR takes: calculate normalization factors with the TMM method (calcNormFactors()), estimate effective library sizes and adjust counts to effective library sizes obtaining noninteger normalized pseudocounts (equalizeLibSizes()) which were subtracted by 0.5 and then raised to the smallest integers not less than these pseudocounts (ceiling()). These steps are written together in the function normalizeCounts() from the tweeDEseq package.
In the case of cqn, normalization offsets are calculated by the function cqn() as log2 RPMs, which are added to original raw log2 RPMs. These are rolled back to absolute numbers and “unlogged” obtaining noninteger normalized pseudocounts which, analogously to the edgeRTMM case, were subtracted by 0.5 and then raised to the smallest integers not less than these pseudocounts (ceiling()). The rationale behind subtracting 0.5 to the pseudocounts instead of directly truncating or raising to the next integer value, is to try to approach as much as possible the correct proportion of zero counts in the normalized data.
However, when performing DE analysis with edgeR, or with DESeq and its own normalization procedure, the specific recommendations made by the corresponding software authors were followed. More concretely, raw counts were not transformed in order to preserve their sampling properties and normalization adjustments entered the DE analysis through the corresponding normalization factors and offsets arguments within the functions that test for DE (see scripts for details in Additional file 1).
Preprocessing of microarray data
The microarray LCL data from [24] was processed from the raw CEL files available at http://www.ncbi.nlm.nih.gov/geo webciteunder accession GSE7792. Firstly, we only considered YRI samples. Secondly, data was processed using the Bioconductor oligo package. Quality assessment was performed by calculating NUSE and RLE diagnostics (Bolstad et al., 2005) and discarding those samples that either of the two reported diagnostics was considered below a minimum quality threshold. Third, using the RMA algorithm (Irizarry et al., 2003) implemented in the rma() function from the oligo package with argument target=~core~, expression values were background corrected, normalized and summarized into Affymetrix transcript clusters. Fourth, most samples formed part of family trios and only samples belonging to father or mother were kept. Fifth, using the getNetAffx() function from the oligo package, Ensembl Transcript identifiers well obtained for each Affymetrix transcript cluster identifier. Sixth, using the bioconductor package biomaRt, Ensembl Transcript identifiers were translated into Ensembl Gene identifiers, resolving multiple assignments by keeping the Ensembl Gene identifier that had a match in the Ensembl Gene identifiers forming the table of counts of the [12] RNAseq data, or choosing one arbitrarily, otherwise. Seven, duplicated assignments of the same Ensembl Gene identifier to multiple Affymetrix transcript cluster identifiers were resolved by keeping the transcript cluster with largest expression variability measured by its interquartile range (IQR).
At this point an expression data matrix of 16,323 Ensembl Genes by 74 samples was obtained and using the scanning date of each CEL file, samples were grouped into 5 batches, out of which one containing only three male samples was discarded leaving a total of 71 samples distributed into 4 balanced batches across gender. Batch effect was removed by using the QRdecomposition method implemented in the removeBatchEffect() function from the Bioconductor package limma[25] while keeping the sexspecific expression effect by setting the gender sample indicator variable within the design matrix argument. Finally, samples and genes were further filtered to match those from the RNAseq table of counts.
Matching RNAseq and microarray expression data matrices
To perform the analyses summarized in Figure 11 and Additional file 2: Figure S12 we further filtered the previously preprocessed RNAseq and microarray gene expression matrices to match both Ensembl Gene identifiers and individual HapMap identifiers. This resulted in two gene expression data matrices of equal dimension with 15,194 genes and 36 samples. We only considered the RNAseq data normalized with the cqn package.
To perform the analyses summarized in Figure 12 we built two other gene expression data matrices where, as before, samples were restricted to those 36 that matched between RNAseq and microarray data but genes were not, leading to a RNAseq and microarray gene expression data matrices of 27,438 and 16,323 Ensembl Genes by 36 samples, respectively. Genes were not matched since the purpose of these analyses was to gather insight into the differences and challenges in detecting DE genes using RNAseq with respect to microarray gene expression data with many replicates.
Functional annotations
Functional annotations for Ensembl genes forming the tables of counts, were retrieved from http://jun2011.archive.ensembl.org webcite with R and the biomaRt Bioconductor package. Gene length and G+C content annotations, used with the cqn normalization method, were obtained by downloading all human cDNAs from http://ftp.ensembl.org/pub/release63/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.63.cdna.all.fa.gz webcite and calculating the length and G+C content of the longest cDNA for each Ensembl gene.
The goldstandard list of genes with sexspecific expression was built with genes reported in the literature that, in one hand, escape chromosome X inactivation [22] and, on the other hand, belong to the malespecific region of chromosome Y [23]. In both cases, gene symbols were first translated into Ensembl gene identifiers and then further filtered to keep only those included in the set of Ensembl gene identifiers release 63. This resulted in a goldstandard list of 95 genes with sexspecific expression.
The list of housekeeping genes was retrieved from the literature [19] and mapped to Ensembl genes release 63, resulting in a final set of 669 housekeeping genes. The expression breadth reported in Figure 5 was obtained through the Barcode Gene Expression catalog [18] which uses information from 18,656 publicly available microarray samples from 131 tissue types, of the HGU133 Plus 2.0 Affymetrix chip, to estimate the proportion of tissue types in which a given probeset is expressed in more than half the samples. After discarding unreliable probes (annotated with highentropy in the catalog), we use these values as surrogates for expression breadth by mapping Affymetrix probeset identifiers to the genes in our table of counts through the hgu133plus2.db Bioconductor annotation package, leading to 16,292 genes with expression breadth values. When two or more probesets mapped to the same gene, the maximum value was taken for that gene.
All these functional data are included in the experimental data package tweeDEseqCountData available at http://www.bioconductor.org webcite under the keywords annotEnsembl63, genderGenes and hkGenes.
PoissonTweedie distributions
PoissonTweedie (PT) distributions have been studied by several authors [3134] and unify several overdispersed count data distributions (see Figure one in [34]). This family of distributions can be defined by a probability generating function and mass probabilities have to be computed using a recursive algorithm [31,34]. ElShaarawi et al. (2011) [34] compared different recursions and parameterizations of this family providing an algorithm to compute the PT probability distribution function. In the R package tweeDEseq we have developed a fast implementation, written in the C programming language, of this recursive algorithm. We briefly describe here the PT family of distributions as well as how we have used it to analyze RNAseq count data in the context of a differential expression (DE) analysis.
Following ElShaarawi et al. (2011) [34], let Y∼PT(a,b,c) be a PT random variable with vector of parameters θ=(a,b,c)^{T} defined in the domain
The PT random variable Y has a probability generating function (pgf) of the form:
when a≠0, while when a=0, then:
Using this parameterization, the following recursive algorithm can be used to compute the PT probability distribution function [34]:
where
and p_{i} denotes the probability of observing i counts.
For the sake of interpretability, we reparameterize θ=(a,b,c) to θ=(μ,ϕ,a), where μ denotes the mean, ϕ=σ^{2}/μ is the dispersion index (σ^{2} is the variance), and a the shape parameter that is used to define some count data distributions that are particular cases of PT such as Poisson or negative binomial. The relationship between both parameterizations is the following:
The PT model includes not only Poisson (a=1) and negative binomial (NB) (a=0) but also other distributions that have been used to analyze count data such as PoissonInverse Gaussian (PIG) (), PólyaAeppli (PA) (a=−1) or Neyman type A (a→−∞). Therefore, the PT distribution family unifies several diverse count data distributions, including different overdispersed distributions such as NB or PIG. These distributions can model different scenarios as, for instance, a RNAseq expression profile with a wide dynamic range leading to a heavy tail in the distribution. In such a case, PIG has a heavier tail than NB and this would make it more appropriate for such a gene. Note that an extremely heavy tail implies overdispersion, but the converse does not hold; hence the NB distribution is not adequate to model RNAseq expression profiles of genes with a wide dynamic range due to their intrinsic biological variability [15].
Given a certain parameterization Kokonendji et al. (2004) [17] prove that the meanvariance relationship for the PT family can be expressed as:
where p is the shape parameter of that specific parameterization. It follows that, whereas the NB distribution is only able to capture a quadratic meanvariance relationship, the PT family is able to generalize this relationship to any order. As a result, it is more convenient to use the PT model when dealing with count data which presents variable overdispersion.
Parameter estimation for PoissonTweedie distributions
We need to estimate the parameter vector to develop, on the one hand, a test of goodnessoffit to an NB distribution and, on the other hand, a twosample PTtest for differences in means. This latter test is used for detecting differentially expressed genes. Without loss of generality, let y_{gk} be the number of counts for gene g in sample k, derived from preprocessing RNAseq data. We assume that y_{gk} follows the PT distribution:
In practice, we do not know the parameters θ_{g}=μ_{g},ϕ_{g},a_{g}, but we can estimate them from data by maximum likelihood when the sample size is sufficiently large so that it guarantees the desirable large sample properties of unbiasedness and minimum variance of the maximum likelihood estimate (MLE). In the Additional file 2: Supplementary Information we provide a simulation study in order to estimate the minimum number of samples per group that approximately meets this requirement (see Additional file 2: Figure S7).
We obtained the MLE using a quasiNewton method with constraints. We have implemented such a procedure using the optim function in R. In order to guarantee good convergence, we consider as initial parameters the moment estimates of μ_{g} and ϕ_{g}, and a_{g}=0. We choose this value for a_{g} because it corresponds to an NB model that is the natural cutpoint of PT’s parameter space.
Goodnessoffit to a negative binomial distribution
In the framework of PT distributions we can formulate a test of the goodness of fit to an NB distribution by considering H_{0}:a=0 versus H_{a}:a≠0. Using a likelihood ratio test (LRT), the testing statistic is [34]
where numerator and denominator correspond to the likelihood functions for the PT and NB models, respectively. Since the PT model has just one parameter more than the NB model, the quantity under the null hypothesis, as n grows large, and it can be used to decide whether count data follow a NB distribution by means of a QQ plot (see Additional file 2: Figure S2) or by calculating the corresponding Pvalue.
Test to determine differentially expressed genes
For a given gene, let us assume that we observe c_{1},c_{2},…,c_{n} counts for n individuals and that we tabulate these counts into a contingency table with cells, y_{0},y_{1},…,y_{m} where m=max{c_{1},…,c_{n}}. Therefore, y_{c} represents the number of observations with c counts. Then, the loglikelihood can be written as follows
where and denotes the mass probability at i with i=0,1,…,m and is computed using the recurrence given in equation (6). ElShaarawi et al. (2011) [34] indicate that when regularity conditions hold, that is, when θ is an interior point of the parameter space θ, asymptotic normality of can be assumed. Therefore, the negative inverse Hessian matrix of the loglikelihood at the MLE corresponds to the estimated covariance matrix of . In particular, for the μ parameter we have that
Consequently, if we are interested in comparing the mean counts for two sample groups, denoted by μ_{A} and μ_{B}, a twosample PTtest for the mean with null hypothesis , which we perform in logarithmic scale as H_{0}: log(μ_{A})= log(μ_{B}), can be built by calculating the PTstatistic:
The PTstatitic, T, follows a standard normal distribution under the null hypothesis. Therefore, the (1−α)% percentile of a N(0,1) distribution is used to determine whether the observed differences between the two groups are statistically significant or not by providing a corresponding Pvalue that can be later on corrected for multiple testing using, for instance, BenjaminiHochberg’s FDR [35].
Simulation of RNAseq data
The results shown in Figure 6 recreating the null hypothesis of no DE with real RNAseq data were performed by dividing the LCL data into two separate data sets of male and female samples. From each data set we bootstrapped 100 times two groups of 20 samples uniformly at random, thus obtaining on the one hand, group pairs of female samples and, on the other hand, group pairs of male samples. On each bootstrapped data set we performed the twosample test for DE detection of every method between the groups of female versus female samples and male versus male samples. We also considered two versions of the data, one with the raw unnormalized counts and the other with the counts normalized with the cqn package [4]. In principle, there are no DE genes to be discovered from these comparisons, and therefore, under the null hypothesis of no DE, the Pvalue distribution for any given gene throughout the 100 bootstrapped data sets should be uniform.
The simulations shown in Figures 7, 8 and 9 contained synthetic RNAseq data generated from a gammaPoisson mixture model in a similar way to other published studies [8]. Under this model, we first draw dispersion parameters ϕ_{g} for every gene g at random from a gamma distribution Gamma(k=2,θ=0.7) and means according to three different foldchanges (1.5, 2 and 4) where half of the genes were upregulated and the other half downregulated. The λ_{gi} Poisson parameter for every gene g and sample i was drawn at random from a gamma distribution Gamma(k=a,θ=1/(ϕ−1)) with a=fμ_{gk}/(ϕ−1) and f≈N(0,σ) corresponding to library factor which was either constant (σ=0) or variable (σ=0.5). Counts were simulated for each gene g from the resulting mixture gammaPoisson distribution with parameters λ_{gi} for each sample i. Note that the resulting marginal distribution from the gammaPoisson is a negativebinomial.
Software availability
•Project name:tweeDEseq
•Project home page: http://www.bioconductor.org/packages/release/bioc/html/tweeDEseq.html webcite
•Operating system(s): Platform independent
•Programming language:R and C
•Other requirements:R 3.0.0
•Licence: GNU GPL
•Any restrictions tu use by nonacademics: no restrictions
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JRG and PP conceived the idea of modelling RNAseq count data using PT family of distributions. ME programmed the recursive algorithm to compute PT probability distribution, performed simulation studies, and created the R package jointly with JRG and RC. PP and JRG proposed the statistical test for detecting DE genes. DG preprocessed the RNAseq data. RC, ME and JRG analysed the data and wrote the paper. The project was supervised by JRG. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by grants from the ‘Ministerio de Ciencia e Innovación  MICINN’ (MTM201126515 to JRG and ME, MTM201009526E to JRG, MTM200910893 to PP and TIN201122826 to RC), from European Reseach Council, Breathe project, ERCAdG, GA num. 268479 to ME.
References

Mortazavi1 A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq.
Nat Methods 2008, 5:621628. PubMed Abstract  Publisher Full Text

Robinson M, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data.
Genome Biol 2010, 11:R25. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Risso D, Schwartz K, Sherlock G, Dudoit S: GCcontent normalization for RNASeq data.
BMC Bioinformatics 2011, 12:480. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNAseq data using conditional quantile normalization.
Biostatistics 2012, 13(2):204216. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marioni J, Mason C, Mane S, Stephens M, Gilad Y: RNAseq: An assessment of technical reproducibility and comparison with gene expression arrays.
Genome Res 2008, 18:15091517. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Robinson MD, Smyth GK: Smallsample estimation of negative binomial dispersion, with applications to SAGE data.
Biostatistics 2008, 9(2):321332. PubMed Abstract  Publisher Full Text

Anders S, Huber W: Differential expression analysis for sequence count data.
Genome Biol 2010, 11(10):R106. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lund SP, Nettleton D, McCarthy DJ, Smyth GK: Detecting differential expression in RNAsequence data using quasiLikelihood with shrunken dispersion estimates.
Stat Appl Genet Mol Biol 2012., 11(5)
doi:10.1093/biostatistics/kxs033.

Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data.
BMC Bioinformatics 2010, 11:422. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

McCarthy DJ, Chen Y, Smyth GK: Differential expression analysis of multifactor RNASeq experiments with respect to biological variation.
Nucleic Acids Res 2012, 40(10):42884297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu H, Wang C, Wu Z: A new shrinkage estimator for dispersion improves differential expression detection in RNAseq data.
Biostatistics 2012.
doi:10.1093/biostatistics/kxs033.

Pickrell J, Marioni J, Pai A, Degner J, Engelhardt B, Nkadori E, Veyrieras J, Stephens M, Gilad Y, Pritchard J: Understanding mechanisms underlying human gene expression variation with RNA sequencing.
Nature 2010, 464:768772. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics.
Genome Biol 2004, 5(10):R80. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Van De Wiel MA, Leday GGR, Pardo L, Rue H, Van Der Vaart AW, Van Wieringen WN: Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors.
Biostatistics 2012.
doi:10.1093/biostatistics/kxs031.

Hansen K, Wu Z, Irizarry R, Leek J: Sequencing technology does not eliminate biological variability.
Nat Biotech 2011, 29:572573. Publisher Full Text

Jorgensen B: The Theory of Dispersion Models. New York: Chapman and Hall; 1997.

Kokonendji C, DossouGbété S, Demétrio C: Some discrete exponencial dispersion models: PoissonTweedie and HindeDemétrio classes.

McCall M, Uppal K, Jaffee H, Zilliox R M J Irizarry: The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes.
Nucleic Acids Res 2011, 39:D1011D1015. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Eisenberg E, Levanon EY: Human housekeeping genes are compact.
Trends Genet 2003, 19(7):362365. PubMed Abstract  Publisher Full Text

Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis.
PLoS Genet 2007, 3(9):17241735. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Storey JD, Tibshirani R: Statistical significance for genomewide studies.
Proc Natl Acad Sci U S A 2003, 100(16):94409445. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Carrel L, HF W: Xinactivation profile reveals extensive variability in Xlinked gene expression in females.
Nature 2005, 434:400404. PubMed Abstract  Publisher Full Text

Skaletsky H, KurodaKawaguchi T, Minx P, Cordum H, Hillier L, Brown L, Repping S, Pyntikova T, Ali J, Bieri T, Chinwalla A, Delehaunty A, Delehaunty K, Du H, Fewell G, Fulton L, Fulton R, Graves T, Hou SF, Latrielle P, Leonard S, Mardis E, Maupin R, McPherson J, Miner T, Nash W, Nguyen C, Ozersky P, Pepin K, Rock S, Rohlfing T, Scott K, Schultz B, Strong C, TinWollam A, Yang SP, Waterston R, Wilson R, Rozen S, Page D: The malespecific region of the human Y chromosome is a mosic of discrete sequence classes.
Nature 2003, 423:825837. PubMed Abstract  Publisher Full Text

Huang RS, Duan S, Bleibel WK, Kistner EO, Zhang W, Clark TA, Chen TX, Schweitzer AC, Blume JE, Cox NJ, Dolan ME: A genomewide approach to identify genetic variants that contribute to etoposideinduced cytotoxicity.
Proc Natl Acad Sci U S A 2007, 104(23):97589563. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments.
Stat Appl Genet Mol Biol 2004., 3
doi:10.2202/15446115.1027.

Nguyen DK, Disteche CM: Dosage compensation of the active X chromosome in mammals.
Nat Genet 2006, 38:4753. PubMed Abstract  Publisher Full Text

Knowles DG, Röder M, Merkel A, Guigó R: Grape RNASeq analysis pipeline environment.
Bioinformatics 2013, 29(5):614621. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

MarcoSola S, Sammeth M, Guigó R, Ribeca P: The GEM mapper: fast, accurate and versatile alignment by filtration.
Nat Methods 2012, 9(12):11851188. PubMed Abstract  Publisher Full Text

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JGR, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE, Guigo R:

Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.
Bioinformatics 2010, 26:139140. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hougaard P, Lee ML, Whitmore G: Analysis of overdispersed count data by mixtures of Poisson variables and Poisson processes.
Biometrics 1997, 53:12251238. PubMed Abstract  Publisher Full Text

Gupta R, Ong S: A new generalization of the negative binomial distribution.
Compu Stat Data An 2004, 45:287300. Publisher Full Text

Puig P, Valero J: Count Data Distributions: Some Characterizations With Applications.
J Am Stat Assoc 2006, 101:332340. Publisher Full Text

ElShaarawi A, Zhu R, Joe H: Modelling species abundance using the PoissonTweedie family.
Environmetrics 2011, 22:152164. Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing.