Bioinformatics Core Facility, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland

Département de formation et recherche, Centre Hospitalier Universitaire Vaudois and University of Lausanne, Lausanne, Switzerland

Abstract

Background

Finding genes that are differentially expressed between conditions is an integral part of understanding the molecular basis of phenotypic variation. In the past decades, DNA microarrays have been used extensively to quantify the abundance of mRNA corresponding to different genes, and more recently high-throughput sequencing of cDNA (RNA-seq) has emerged as a powerful competitor. As the cost of sequencing decreases, it is conceivable that the use of RNA-seq for differential expression analysis will increase rapidly. To exploit the possibilities and address the challenges posed by this relatively new type of data, a number of software packages have been developed especially for differential expression analysis of RNA-seq data.

Results

We conducted an extensive comparison of eleven methods for differential expression analysis of RNA-seq data. All methods are freely available within the R framework and take as input a matrix of counts, i.e. the number of reads mapping to each genomic feature of interest in each of a number of samples. We evaluate the methods based on both simulated data and real RNA-seq data.

Conclusions

Very small sample sizes, which are still common in RNA-seq experiments, impose problems for all evaluated methods and any results obtained under such conditions should be interpreted with caution. For larger sample sizes, the methods combining a variance-stabilizing transformation with the ‘limma’ method for differential expression analysis perform well under many different conditions, as does the nonparametric SAMseq method.

Background

Transcriptome analysis is an important tool for characterization and understanding of the molecular basis of phenotypic variation in biology, including diseases. During the past decades microarrays have been the most important and widely used approach for such analyses, but recently high-throughput sequencing of cDNA (RNA-seq) has emerged as a powerful alternative

Arguably the most common use of transcriptome profiling is in the search for differentially expressed (DE) genes, that is, genes that show differences in expression level between conditions or in other ways are associated with given predictors or responses. RNA-seq offers several advantages over microarrays for differential expression analysis, such as an increased dynamic range and a lower background level, and the ability to detect and quantify the expression of previously unknown transcripts and isoforms

Other types of non-uniformities are seen

Microarrays have been used routinely for differential expression analysis for over a decade, and there are well-established methods available for this purpose (such as limma

The field of differential expression analysis of RNA-seq data is still in its infancy and new methods are continuously being presented. So far, there is no general consensus regarding which method performs best in a given situation and few extensive comparisons between the proposed methods have been published. In a recent paper

In the present paper we conduct a comparison of eleven methods, developed for differential expression analysis of RNA-seq data, under different experimental conditions. Among the eleven methods, nine model the count data directly, while the remaining two transform the counts before applying a traditional method for differential expression analysis of microarray data. The study is confined to methods that are implemented and available within the R framework

Results and discussion

Eleven methods for differential expression analysis of RNA-seq data were evaluated in this study. Nine of them work on the count data directly: DESeq

The methods were evaluated mainly based on synthetic data, where we could control the settings and the true differential expression status of each gene. Details regarding the different simulation studies can be found in the Materials and Methods section. As the baseline (simulation studies abbreviated ‘

The total number of genes in each simulated data set was 12,500, and the number of differentially expressed (DE) genes was set to either 0, 1,250 or 4,000. We also varied the composition of the DE genes, that is, the fraction of DE genes that were up- and downregulated, respectively, in one condition compared to the other. Finally, we evaluated the effect of varying the sample size, from 2 to 5 or 10 samples per condition. These sample sizes were chosen to reflect a wide range of experimental settings. Since, however, most current RNA-seq experiments exhibit small sample sizes and the choice in the experimental design is often between two or three samples per condition, we also performed some comparisons with 3 samples per condition. These comparisons, contrasted with the results from 2 and 5 samples per condition, are given in the supplementary material (Additional file

**Contains supplementary figures referred to in the text.** Here, we also evaluate the effect of selecting different values for the parameters of edgeR and DESeq and evaluate two additional transformation-based methods, and we evaluate the effect of simulating data with different dispersion parameter in the two compared conditions. We also present some comparisons based on data sets with 3 samples per condition. The file also contains information regarding the estimation of the mean and dispersion parameters from real data, and an additional analysis of two real RNA-seq data sets. Finally, it contains sample R code to run the differential expression analysis and estimates of the computational time requirements for the different methods.

Click here for file

In addition to the simulated data, we compared the methods based on their performance for three real RNA-seq data set. The results from one of these data sets are shown in the main article, and the remaining two are discussed in the supplementary material (Additional file

Using the synthetic data, we studied the following aspects of the methods under different experimental conditions:

•The ability to rank truly DE genes ahead of non-DE genes. This was evaluated in terms of the area under a Receiver Operating Characteristic (ROC) curve (AUC), as well as in terms of false discovery curves, depicting the number of false detections encountered while going through the list of genes ranked according to the evidence for differential expression.

•The ability to control type I error rate and false discovery rate at an imposed level. This was evaluated by computing the observed type I error and the true false discovery rate, respectively, among the genes called differentially expressed at given significance levels.

•The computational time requirement for running the differential expression analysis. These results are given in the supplementary material (Additional file

For the real RNA-seq data we compared the collections of genes called DE by the different methods, both in terms of their individual cardinalities and in terms of their overlaps. We also studied the concordance of the gene rankings obtained by the different methods.

Discrimination between DE and non-DE genes

We first evaluated to what extent the eleven considered methods were able to discriminate between truly DE genes and truly non-DE ones. We computed a score for each gene and each method, which allowed us to rank the genes in order of significance or evidence for differential expression between the two conditions. For the six methods providing nominal p-values (edgeR, DESeq, NBPSeq, TSPM, voom+limma, vst+limma), we defined the score as 1 - _{nom}. For SAMseq we used the absolute value of the averaged Wilcoxon statistic as the ranking score, and for baySeq, EBSeq and ShrinkSeq we used the estimated posterior probability of differential expression or, equivalently in terms of ranking, 1 - BFDR, where BFDR denotes the estimated Bayesian False Discovery Rate _{NOISeq} (see Materials and Methods). All these scores are two-sided, that is, they are not affected by the direction of differential expression between the two conditions. Given a threshold value for such a score, we may thus choose to call all genes with scores exceeding the threshold DE, and correspondingly all genes with scores below the threshold are called non-DE. Considering the genes that were simulated to be DE as the true positive group and the remaining genes as the true negative group, we computed the false positive rate and the true positive rate for all possible score thresholds and constructed a ROC (Receiver Operating Characteristic) curve for each method. The area under the ROC curve (AUC) was used as a measure of the overall discriminative performance of a method, that is, the overall ability to rank truly DE genes ahead of truly non-DE ones.

Under baseline conditions, and when only 10% of the genes were simulated to be DE (simulation studies _{2} compared to S_{1} than when some genes were upregulated and some were downregulated (compare Figures _{2} compared to condition S_{1} (Figure

Area under the ROC curve (AUC)

**Area under the ROC curve (AUC).** Area under the ROC curve (AUC) for the eleven evaluated methods, in simulation studies **A**), **B**), **C**), **D**), **E**) and **F**). The boxplots summarize the AUCs obtained across 10 independently simulated instances of each simulation study. Each panel shows the AUCs across three sample sizes (|S_{1}| = |S_{2}| = 2, 5 and 10, respectively, signified by the last number in the tick labels). The methods are ordered according to their median AUC for the largest sample size. When all DE genes were regulated in the same direction, increasing the number of DE genes from 1,250 (panel **A**) to 4,000 (panel **C**) impaired the performance of all methods. In contrast, when the DE genes were regulated in different directions (panels **B** and **D**), the number of DE genes had much less impact. The variability of the performance of baySeq was much higher when all genes were regulated in the same direction (panels **A** and **C**) compared to when the DE genes were regulated in different directions (panels **B** and **D**). Including outliers (panels **E** and **F**) decreased the AUC for most methods (compare to panel **B**), but less so for the transformation-based methods (voom+limma and vst+limma) and SAMseq.

For the largest sample sizes (5 or 10 samples per condition) and when there were both up- and downregulated genes, all methods performed similarly in terms of the AUC. All methods performed better for large sample sizes. TSPM and EBSeq showed the strongest sample size dependencies among the methods, followed by SAMseq and baySeq. For the smallest sample size (2 samples per condition), the best results were generally obtained by DESeq, edgeR, NBPSeq, voom+limma and vst+limma.

When all DE genes were upregulated in condition S_{2} compared to condition S_{1} (Figures

We chose to evaluate the effect of introducing non-overdispersed genes or outliers under the settings of simulation study

While the AUC provides an overall measure of the ability to rank truly DE genes ahead of truly non-DE genes, it does not immediately tell us if the deviation from a perfect discrimination is mainly due to false positives or false negatives. We therefore also constructed false discovery curves, depicting the number of false discoveries encountered as the total number of discoveries increased (that is, as the significance threshold for the ranking score was changed). Figure

False discovery curves

**False discovery curves.** Representative false discovery curves, depicting the number of false positives encountered among the T top-ranked genes by the eleven evaluated methods, for T between 0 and 1,500. In all cases, there were 5 samples per condition. **A**: Simulation study **B**: Simulation study **C**: Simulation study **D**: Simulation study **E**: Simulation study **F**: Simulation study

Larger sample sizes led to considerably fewer false positives found among the top-ranked genes (compare Figure

Control of type I error rate

Next, we evaluated the six methods returning nominal p-values (edgeR, DESeq, NBPSeq, TSPM, voom+limma and vst+limma) in terms of their ability to control the type I error at a pre-specified level in the absence of any truly DE genes. Under baseline conditions (simulation study

Type I error rates

**Type I error rates.** Type I error rates, for the six methods providing nominal p-values, in simulation studies **A**), **B**), **C**) and **D**). Letting some counts follow a Poisson distribution (panel **B**) reduced the type I error rates for TSPM slightly but had overall a small effect. Including outliers with abnormally high counts (panels **C** and **D**) had a detrimental effect on the ability to control the type I error for edgeR and NBPSeq, while DESeq became slightly more conservative.

The results stayed largely similar when we let the counts for half of the genes be Poisson distributed (simulation study

Control of the false discovery rate

Next, we examined whether setting a significance threshold for the adjusted p-value (or an FDR threshold) indeed controlled the false discovery rate at the desired level. We put the FDR threshold at 0.05, and calculated the true false discovery rate as the fraction of the genes called significant at this level that were indeed false discoveries. Since NOISeq does not return a statistic that is recommended to use as an adjusted p-value or FDR estimate, it was excluded from this evaluation. For baySeq, EBSeq and ShrinkSeq, we imposed the desired threshold on the Bayesian FDR

As above, when only 10% of the genes were DE, the direction of their regulation had little effect on the false discovery rate (simulation studies

True false discovery rates

**True false discovery rates.** True false discovery rates (FDR) observed for an imposed FDR threshold of 0.05, for the nine methods returning adjusted p-values or FDR estimates, in simulation studies **A**), **B**), **C**) **D**), **E**) and **F**). With only two samples per condition, three of the methods (vst+limma, voom+limma and SAMseq) did not call any DE genes, and the FDR was considered to be undefined.

When the DE genes were regulated in different directions, increasing the number of DE genes from 1,250 to 4,000 improved the ability to control the FDR (simulation study

In a practical situation, we are not only interested in keeping the rate of false discoveries low, but also to actually be able to find the true positives. Therefore, we also computed the true positive rate (the fraction of truly DE genes that were found to be significant) among the genes that were called significant at a FDR threshold of 0.05. In general, DESeq and baySeq tended to give the lowest number of true positives (Additional file

As expected, increasing the expression difference between the two conditions (_{g}, see Materials and Methods) improved the ability to detect truly DE genes and reduced the observed false discovery rate, in a concordant manner for all methods (data not shown). When the dispersions in the two conditions were different, we observed an increased FDR for the majority of the methods (Additional file

Real RNA-seq data from two mouse strains

In addition to the synthetic data set, we also analyzed an RNA-seq data set from 21 mice, 10 of the C57BL/6J strain and 11 of the DBA/2J strain

First, we compared the number of DE genes found by each method (Figure

Analysis of the Bottomly data set

**Analysis of the Bottomly data set. A**: The number of genes found to be significantly DE between the two mouse strains in the Bottomly data set. **B-C**: Overlap among the set of DE genes found by different methods. **D**: The average number of genes found to be significantly DE genes when contrasting two subsets of mice from the same strain, in which case we expect no truly DE genes.

**ShrinkSeq**

**DESeq**

**edgeR**

**NBPSeq**

**TSPM**

**voom**

**vst**

**baySeq**

**EBSeq**

**SAMseq**

The table contains the number of differentially expressed genes that are shared between each pair of methods, for the Bottomly data set (compare to Figure

ShrinkSeq

**3259**

583

1125

985

1075

971

1049

192

803

1821

DESeq

583

**598**

598

567

588

589

587

191

523

592

edgeR

1125

598

**1160**

877

886

942

1013

194

753

1099

NBPSeq

985

567

877

**1082**

695

753

797

194

612

924

TSPM

1075

588

886

695

**1161**

891

907

191

794

1014

voom

971

589

942

753

891

**1009**

971

194

752

991

vst

1049

587

1013

797

907

971

**1095**

194

752

1061

baySeq

192

191

194

194

191

194

194

**195**

175

194

EBSeq

803

523

753

612

794

752

752

175

**819**

801

SAMseq

1821

592

1099

924

1014

991

1061

194

801

**1860**

In Additional file

In Additional file

To further evaluate the performance of the methods, we applied them to the data set consisting of only the mice from the C57BL/6J strain, within which we defined two arbitrary sample classes of 5 samples each. The analysis was repeated five times for different arbitrary divisions. Under these conditions, we expect that no genes are truly DE. Nevertheless, most methods found differentially expressed genes in at least one instance. TSPM found by far the largest number of DE genes (Figure

Conclusions

In this paper, we have evaluated and compared eleven methods for differential expression analysis of RNA-seq data. Table

The table summarizes the present study by means of the main observations and characteristic features for each of the evaluted methods. We have grouped voom+limma and vst+limma together since they performed overall very similarly.

DESeq

- Conservative with default settings. Becomes more conservative when outliers are introduced.

- Generally low TPR.

- Poor FDR control with 2 samples/condition, good FDR control for larger sample sizes, also with outliers.

- Medium computational time requirement, increases slightly with sample size.

edgeR

- Slightly liberal for small sample sizes with default settings. Becomes more liberal when outliers are introduced.

- Generally high TPR.

- Poor FDR control in many cases, worse with outliers.

- Medium computational time requirement, largely independent of sample size.

NBPSeq

- Liberal for all sample sizes. Becomes more liberal when outliers are introduced.

- Medium TPR.

- Poor FDR control, worse with outliers. Often truly non-DE genes are among those with smallest p-values.

- Medium computational time requirement, increases slightly with sample size.

TSPM

- Overall highly sample-size dependent performance.

- Liberal for small sample sizes, largely unaffected by outliers.

- Very poor FDR control for small sample sizes, improves rapidly with increasing sample size. Largely unaffected by outliers.

- When all genes are overdispersed, many truly non-DE genes are among the ones with smallest p-values. Remedied when the counts for some genes are Poisson distributed.

- Medium computational time requirement, largely independent of sample size.

voom / vst

- Good type I error control, becomes more conservative when outliers are introduced.

- Low power for small sample sizes. Medium TPR for larger sample sizes.

- Good FDR control except for simulation study

- Computationally fast.

baySeq

- Highly variable results when all DE genes are regulated in the same direction. Less variability when the DE genes are regulated in different directions.

- Low TPR. Largely unaffected by outliers.

- Poor FDR control with 2 samples/condition, good for larger sample sizes in the absence of outliers. Poor FDR control in the presence of outliers.

- Computationally slow, but allows parallelization.

EBSeq

- TPR relatively independent of sample size and presence of outliers.

- Poor FDR control in most situations, relatively unaffected by outliers.

- Medium computational time requirement, increases slightly with sample size.

NOISeq

- Not clear how to set the threshold for _{
NOISeq
} to correspond to a given FDR threshold.

- Performs well, in terms of false discovery curves, when the dispersion is different between the conditions (see supplementary material).

- Computational time requirement highly dependent on sample size.

SAMseq

- Low power for small sample sizes. High TPR for large enough sample sizes.

- Performs well also for simulation study

- Largely unaffected by introduction of outliers.

- Computational time requirement highly dependent on sample size.

ShrinkSeq

- Often poor FDR control, but allows the user to use also a fold change threshold in the inference procedure.

- High TPR.

- Computationally slow, but allows parallelization.

Small sample sizes (2 samples per condition) imposed problems also for the methods that were indeed able to find differentially expressed genes, there leading to false discovery rates sometimes widely exceeding the desired threshold implied by the FDR cutoff. For the parametric methods this may be due to inaccuracies in the estimation of the mean and dispersion parameters. In our study, TSPM stood out as the method being most affected by the sample size, potentially due to the use of asymptotic statistics. Even though the development goes towards large sample sizes, and barcoding and multiplexing create opportunities to analyze more samples at a fixed cost, as of today RNA-seq experiments are often too expensive to allow extensive replication. The results conveyed in this study strongly suggest that the differentially expressed genes found between small collections of samples need to be interpreted with caution and that the true FDR may be several times higher than the selected FDR threshold.

DESeq, edgeR and NBPSeq are based on similar principles and showed, overall, relatively similar accuracy with respect to gene ranking. However, the sets of significantly differentially expressed genes at a pre-specified FDR threshold varied considerably between the methods, due to the different ways of estimating the dispersion parameters. With default settings and for reasonably large sample sizes, DESeq was often overly conservative, while edgeR and in particular NBPSeq often were too liberal and called a larger number of false (and true) DE genes. In the supplementary material (Additional file

EBSeq, baySeq and ShrinkSeq use a different inferential approach, and estimate the posterior probability of being differentially expressed, for each gene. baySeq performed well under some conditions but the results were highly variable, especially when all DE genes were upregulated in one condition compared to the other. In the presence of outliers, EBSeq found a lower fraction of false positives than baySeq for large sample sizes, while the opposite was true for small sample sizes.

Methods

In the following section we give a brief overview of the eleven methods for differential expression analysis that are evaluated and compared in the present paper. For more elaborate descriptions we refer to the original publications. All methods take their starting point in a count matrix, containing the number of reads mapping to each gene in each of the samples in the experiment. Nine of the methods work directly on the count data, while the remaining two transform the counts and feed the transformed values into the R package limma

The methods working directly on the count data can be broadly divided into parametric (baySeq

Most of the remaining parametric models (baySeq, DESeq, EBSeq, edgeR and NBPSeq) use instead a Negative Binomial (NB) model to account for the overdispersion, while ShrinkSeq allows the user to select among a number of different distributions, including the NB and a zero-inflated NB distribution. DESeq, edgeR and NBPSeq take a classical hypothesis testing approach, while baySeq, EBSeq and ShrinkSeq instead are cast within a Bayesian framework. It is acknowledged that a crucial part of the inference procedure is to obtain a reliable estimate of the dispersion parameter for each gene, and hence considerable effort is put into this estimation. Due to the small sample size in most RNA-seq experiments it is difficult to estimate the gene-wise dispersion parameters reliably, which motivates information sharing across all genes in the data set in order to obtain more accurate estimates. Both DESeq, edgeR and NBPSeq incorporate information sharing in the dispersion estimation, and the way that this information sharing is done accounts for the main difference between the three methods. The first suggestion

The approach used by baySeq and EBSeq is similar to the three previously mentioned methods in terms of the underlying NB model, but differs in terms of the inference procedure. For baySeq, the user defines a collection of

ShrinkSeq, which also takes a Bayesian perspective, supports a number of different count models, including the NB and a zero-inflated NB. It provides shrinkage of the dispersion parameter, but also of other parameters such as the regression coefficients that are of interest for the inference. Furthermore, it incorporates a step for refining the priors, and subsequently the posteriors, non-parametrically after fitting the model for each feature.

The two non-parametric methods evaluated here, NOISeq and SAMseq, do not assume any particular distribution for the data. SAMseq is based on a Wilcoxon statistic, averaged over several resamplings of the data, and uses a sample permutation strategy to estimate a false discovery rate for different cutoff values for this statistic. These estimates are then used to define a q-value for each gene. NOISeq explores the distribution of fold-changes and absolute expression differences between the two contrasted conditions for the observed data, and compares this distribution to the corresponding distribution obtained by comparing pairs of samples belonging to the same condition (this is called the “noise distribution”). Briefly, NOISeq computes, for each gene, a statistic (here denoted _{NOISeq}) defined as the fraction of points from the noise distribution that correspond to a lower fold change and a lower absolute expression difference than those of the gene of interest in the original data.

Finally, the two transformation approaches (the variance stabilizing transformation provided in the DESeq R package and the voom transformation from the limma R package) aim to find a transformation of the counts to make them more amenable to analysis by traditional methods developed for differential expression analysis in the microarray context. The variance-stabilizing transformation provided in the DESeq R package (here denoted ‘vst’) explicitly computes the transformation by assuming a NB distribution and using dispersion estimates obtained as for DESeq. The ‘voom’ transformation from the limma R package essentially log-transforms the normalized counts and uses the mean-variance relationship for the transformed data to compute gene weights, which are then used by limma during the differential expression analysis.

In the present study, we focus on two-group comparisons only, since this is arguably the most common situation in practice. However, most of the evaluated methods support also more complex experimental designs. Most methods (edgeR, DESeq, NBPSeq, TSPM) achieve this through a generalized linear model (GLM) framework, where the user can specify desired contrasts to test. The limma package offers similarly flexible design options for the transformed data. The Bayesian methods (baySeq and EBSeq) allow the user to provide models defining collections of samples that are supposed to share the same distributional parameters, and return the posterior likelihood of each model thus defined. ShrinkSeq is based on the general framework of Gaussian latent models through the INLA approach

Parameter choices

Many of the methods that are compared in this paper allow the user to select the value of certain parameters, that can affect the results in various ways. We have mostly used the default values provided in the implementations, but in the supplementary material (Additional file

For edgeR, we used the TMM method (Trimmed Mean of M-values

For DESeq, we computed a pooled estimate of the dispersion parameter for each gene. We used local regression to find the mean-variance relationship and employed the conservative approach of selecting the largest among the fitted value and the individual dispersion estimate for each gene. Also here, we used the implemented exact test to find DE genes. The local regression approach was also used in the variance-stabilizing transformation provided by the DESeq package (denoted ‘vst’). Here, we used instead the ‘blind’ option for the dispersion estimation.

Also for TSPM, baySeq, voom and NBPSeq we used the TMM method to compute normalization factors. For NOISeq, we normalized the counts using the TMM method before feeding the data into the differential expression analysis. Furthermore, for NBPSeq we used the ‘NBP’ parametrization of the Negative Binomial distribution. For baySeq, we assumed a Negative Binomial distribution and used the quasi-likelihood approach to estimate priors. We used a sample size of 5,000 to estimate the priors. Furthermore, we assumed equal dispersion for a gene in the two sample groups and used the ‘BIC’ option for the prior re-estimation step. For EBSeq, we used the default ‘median’ normalization method, that is, the normalization provided with DESeq

Before applying ShrinkSeq, we normalized the counts using TMM normalization factors. Within ShrinkSeq we then employed a zero-inflated Negative Binomial distribution, and applied shrinkage to the dispersion parameter as well as the regression coefficient of interest in the inference procedure. To make the results from ShrinkSeq comparable to those from the other methods, we did not impose a non-zero fold change threshold when estimating the false discovery rates.

Data sets

Most of the evaluations in this paper are based on synthetic data, where we could control the settings and the true differential expression status of each gene. We generated the counts for each gene from a Negative Binomial distribution, with mean and dispersion parameters estimated from real RNA-seq data, following the same approach as in

We let _{1}, …, _{|G|}} denote the set of genes in our data set. In the synthetic data sets, we took |G|=12,500. Similarly, we let _{1}, …, _{|S|}} denote the set of samples, and assumed that these were partitioned into two subsets S_{1} and S_{2}. In our experiments, we let |S_{1}|=|S_{2}| and we thought of S_{1} as the “control” group of samples and S_{2} as a group of samples with an abnormal phenotype. We let _{2}. Similarly, _{2} compared to S_{1}.

The random variable representing the count for gene g in sample s was denoted _{gs}. It was modeled by a Negative Binomial distribution, following the approach outlined in

Here, _{gs} is the dispersion parameter, controlling the level of overdispersion. Moreover,

where _{s} is the sequencing depth for sample s, which we defined as _{s} = 10^{7}
_{s} for _{s} ~ _{1}, _{2}} denoted the condition for sample s. We let the dispersion parameter _{gs} be the same in the two sample groups, that is, _{gs} = _{g} for all

For each gene, we drew a pair of values _{g} from those estimated from the real RNA-seq data. We then defined

The parameter _{g} denoted the lower bound on the differential expression between the two groups. In our simulations, we let _{g} = 1.5 for all g.

To simulate different real situations, we also evaluated the effect of generating the counts for half of the genes using a Poisson distribution (i.e., without overdispersion, simulation studies denoted ‘

**Sim. study**

**|{****
g;**

**‘Single’ outlier fraction**

**‘Random’ outlier fraction**

In all synthetic data sets, the observations were distributed between two conditions (denoted S_{1} and S_{2}), with the same number of observations (2, 5 or 10) in each condition. We let _{2} compared to S_{1}. The number of genes whose counts were drawn from a Poisson distribution (i.e., with the dispersion parameter equal to zero) is given by |{_{
g
} = 0}|. The ‘single’ outlier fraction denotes the fraction of the genes for which we selected a single sample and multiplied the corresponding count with a factor between 5 and 10. The ‘random’ outlier fraction denotes the fraction of counts that were selected randomly (among all counts) and multiplied with a factor between 5 and 10. The notation for the simulation studies (leftmost column) summarizes the type of simulation (_{2} (i.e., _{2} (i.e.,

0

0

0

0

0

1,250

0

0

0

0

625

625

0

0

0

4,000

0

0

0

0

2,000

2,000

0

0

0

0

0

6,250

0

0

625

625

6,250

0

0

0

0

0

10%

0

625

625

0

10%

0

0

0

0

0

5%

625

625

0

0

5%

In addition to the synthetic data, we also considered a real RNA-seq data set

Competing interest

The authors declare that they have no competing interests.

Authors’ contributions

CS and MD contributed to the design of the study, the interpretation of the results and the writing of the manuscript. CS performed the implementation and the numerical experiments. Both authors read and approved the final manuscript.