CSIRO Plant Industry, Black Mountain Laboratories, Canberra, Australia

Mathematical Sciences Institute, Australian National University, Canberra, Australia

Prince of Wales Clinical School and School of Mathematics and Statistics, University of New South Wales, Sydney, Australia

Abstract

Background

RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods.

Results

Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices.

Conclusions

This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates.

Background

RNA sequencing (RNA-Seq) allows an entire transcriptome to be surveyed at single-base resolution whilst concurrently profiling gene expression levels on a genome scale

Arguably, the most popular use of RNA-Seq is profiling of gene expression or transcript abundance between samples or differential expression (DE). The efficiency, resolution and cost advantages of using RNA-Seq as a tool for profiling DE has prompted many biologists to abandon microarrays in favour of RNA-Seq

Despite the advantages of using RNA-Seq for DE analysis, there are several sources of sequencing bias and systematic noise that need to be considered when using this approach. Clearly, RNA-Seq analysis is vulnerable to the general biases and errors inherent in the next-generation sequencing (NGS) technology upon which it is based. These errors and biases include: sequencing errors (wrong base calls), biases in sequence quality, nucleotide composition and error rates relative to the base position in the read

Recently, there have been several investigations

Despite the known biases, RNA-Seq continues to be widely and successfully used to profile relative transcript abundances across samples to identify differentially expressed transcripts

Good experimental design and appropriate analysis is integral to maximising the power of any NGS study. With regard to RNA-Seq, important experimental design decisions include the choice of sequencing depth and number of technical and/or biological replicates to use. For researchers with a fixed budget, often a critical design question is whether to increase the sequencing depth at the cost of reduced sample numbers or to increase the sample size with limited sequencing depth for each sample

Sequencing depth

Sequencing depth is usually referenced to be the expected mean coverage at all loci over the target sequence(s), in the case of RNA-seq experiments assuming all transcripts having similar levels of expression. Without the benefit of extensive previous RNA-Seq studies, it is difficult in most cases to estimate prior to data generation the optimal sequencing depth or amount of sequencing data required to adequately power the detection of DE in the transcriptome of interest. Pragmatically, RNA-seq sequencing depth is typically chosen based on an estimation of total transcriptome length (bases) and the expected dynamic range of transcript abundances. Given the dynamic nature of the transcriptome, the suitability of these estimates could vary substantially across organisms, tissues, time points and biological contexts.

Wang et al.

Replication

Replication is vital for robust statistical inference of DE. In the context of RNA sequencing, multiple nested levels of technical replication exist depending upon whether it is the sequence data generation, library preparation or RNA extraction technical processes that are being replicated from the same biological sample. Several published studies have incorporated technical replicates into their RNA-Seq experimental designs

It has been shown that power to detect DE improves when the number of biological replicates

Efficient experimental design

Multiplexing is an increasingly popular approach that allows the sequencing of multiple samples in a single sequencing lane or reaction and consequently the reduction in sequencing costs per sample

Approach

Improving detection of DE requires not only an appropriate experimental design but also a suitably powered analysis approach. Several algorithms have recently been developed specifically to appropriately handle expected technical and biological variation arising from RNA-Seq experiments. A non-exhaustive list of these algorithms is: edgeR

To quantify the effects of different sequencing depths and replication choices we compared a range of realistic experimental designs for their ability to robustly detect DE. Using simulated data with known DE transcripts allowed us to estimate the false positive rate (FPR) and true positive rate (TPR) of DE calls. The changes of these rates were used to compare the detection power yielded by each choice of number of biological replicates and sequencing depth.

In the Methods section, we outline the definitions used for FPR and TPR as well as explaining the method used for the construction of the synthetic data; which includes induced differential expression, simulates the variations that biological replicates introduce and simulates loss of sequencing depth.

In our study, we test a wide range of real-world experimental design scenarios for performance under the null hypothesis and in the presence of DE. In these scenarios both the numbers of biological replicates

Results

Comparisons of statistical methods: edgeR, DESeq, and NBPSeq using simulated data under the null

To test the performance of each package under the null hypothesis, we simulated sets of

The percentage of transcripts reported differentially expressed, FPR defined by Eq. 4 by three software packages for synthetic data generated under the null hypothesis of no DE between two conditions

**The percentage of transcripts reported differentially expressed, FPR defined by Eq. 4 by three software packages for synthetic data generated under the null hypothesis of no DE between two conditions.** In the lower two panels the set of transcripts has been divided into those with greater than 100 counts (DE-high) and those with less than or equal to 100 counts (DE-low) averaged over biological replicates. The number of biological replicates in each condition was varied over the range

Histograms of p-values calculated by three software packages for one particular example of synthetic data generated under the null hypothesis for the case

**Histograms of p-values calculated by three software packages for one particular example of synthetic data generated under the null hypothesis for the case ****= 3.** In the two right hand columns the set of transcripts has been divided into high-count transcripts (> 100 counts) and low-count transcripts (≤ 100 counts) respectively. ‘Percentage of total’ is the percentage of p-values falling within each of 100 bins in each histogram.

Immediately noticeable in the p-value histogram is a sharp spike in the right hand bin for low count transcripts, which is observed to be present in general for all values of _{
i
} for each transcript

The package edgeR performs well for large numbers of biological replicates (

In an effort to be conservative, DESeq chooses as its estimate of dispersion the maximum of a per-transcript estimate and the functional form Eq. 2 which is fitted to the per-transcript estimates for all transcripts. Our results indicate that the method performs well for the high-count transcripts when the number of biological replicates is small (

The package NBPSeq imposes the functional relationship Eq. 3, which appears to be too restrictive for a number of relatively highly dispersed transcripts. For those transcripts the dispersion parameter is underestimated, leading to an overestimate of significance and hence an inflated FPR irrespective of the number of biological replicates.

Based on these results we selected DESeq (v1.6.1) and edgeR (v2.4.0) for use in subsequent experimental design testing. Throughout these tests, results obtained using DESeq and edgeR are mostly compatible with each other. However, our comparison revealed a slightly inflated FPR from edgeR while DESeq behaves more conservatively throughout. Therefore, in the following section we will focus on the results obtained using DESeq while the analogous results obtained with edgeR are presented in the Additional file

**Figure S2.** FPR and TPR detected by edgeR as a function of sequencing depth and replication. Different symbols represent the number **A:** TPR _{adj} ≤ 0.01. **B:** FPR _{adj} ≤ 0.01. The solid grey line (“multiplex line”) connecting the TPR values of

Click here for file

Comparison of statistical methods: DESeq and edgeR using simulated data with 15% DE transcripts

To test the performance of packages in the presence of an alternate hypothesis, we simulated sets of

Detection of DE as a function of number of biological replicates

With an increase in replication we saw a steady increase in the percentage DE calls by DESeq (call rate), increasing from 0.44% to 5.12% as

**%**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

Effects of biological replication on power to detect DE using DESeq. FPR and TPR are defined in Eqs. 5 & 6 respectively at 1%. “call rate” is the total number of reported positives / the total number of transcripts. These values are also represented in Figure

call rate %

0.44

1.15

1.76

3.03

4.08

5.12

FPR %

0.04

0.06

0.06

0.06

0.05

0.04

TPR %

3.26

8.95

13.95

24.30

32.72

41.57

Detection of DE as a function of sequencing depth

Figure

TPR and FPR detected by DESeq as a function of sequencing depth and replication

**TPR and FPR detected by DESeq as a function of sequencing depth and replication.** Different symbols represent the number **A:** TPR (Eq. 6 at _{adj} ≤ 0.01. **B:** FPR (Eq. 5 at _{adj} ≤ 0.01. The solid grey line (“multiplex line”) connecting the TPR values of

Table

**Depth**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

Effects of sequencing depth on FPR values for a subset of our tested depths = 25%, 50%, 75% & 100%.

25%

0.02

0.02

0.04

0.03

0.03

0.03

50%

0.03

0.03

0.04

0.05

0.04

0.03

75%

0.04

0.06

0.05

0.07

0.04

0.04

100%

0.04

0.06

0.06

0.06

0.05

0.04

**Depth**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

**
n =
**

Effects of sequencing depth on TPR values for a subset of our tested depths = 25%, 50%, 75% & 100%.

25%

1.57

6.24

10.40

19.18

26.08

35.41

50%

2.58

7.63

12.40

22.34

29.66

39.16

75%

3.01

8.47

13.16

23.44

31.57

40.65

100%

3.26

8.95

13.95

24.30

32.72

41.57

Detection of DE across multiplex experimental design strategies

We simulated various scenarios of multiplexing

Same as Figure
_{adj }

**Same as Figure****but using 2-fold-changes as the criterion for FPR and TPR instead of **_{adj }**≤ 0.01. ****A:** TPR fold-change ≥ 2. **B:** FPR fold-change ≥ 2. The “multiplex line” connects the TPR and and FPR values of

The multiplex line in Figure

Fold-changes as indicators of DE

It is common practice among biologists to use fold-change, rather than p-values, as an indicator of DE. Figure

Discussion

Comparisons of DE algorithms: edgeR, DESeq and NBPSeq

Our comparison of these three DE detection algorithms under the null hypothesis revealed different performances (measured by their FPR) when different numbers of biological replicates

This comparison led us to use both DESeq and edgeR throughout our replication and sequencing depth simulations. We ultimately chose DESeq’s results^{a} as this package behaved slightly more conservatively and appeared less sensitive to changes in replication (see Figure
^{b}. However in no instance do we obtain a FPR larger than 1% for DESeq (2% for edgeR) – (see Figures

Effects of replication for detection of DE

To quantify the effects of replication in RNA-Seq DE experiments, we tested

Our results clearly support the simple message that more biological replicates are not only desirable but needed to improve the quality and reliability of DE detection using RNA-Seq, however, due to the costs associated with RNA-Seq, many experiments are likely to need to use multiplex designs to achieve this level of replication.

This study is concerned with the simulation of overdispersion effects due to biological variability and it is implied that overdispersion due to technical variability is nested within this estimation (see Methods section). It is worth mentioning that, while biological variability is important, the contribution to overdispersion by technical variation is not negligible, and disagreements between estimates of expression can occur at all levels of coverage

Effects of sequencing depth for detection of DE

To quantify the effects of sequencing depth in RNA-Seq DE experiments, we simulated an extensive sequencing depth range (100% to 1%) for every case of

We conclude that DE analysis with RNA-Seq is robust to substantial loss of sequencing data as indicated by a slow decline in TPR as sequencing depth is lost accompanied by no increase in FPR. These findings seem consistent with the results reported by Bashir et al.

Multiplexing experimental designs

To quantify the effects of varying both

Our simulations strongly support that the benefits of multiplexing

While the detection of DE appears robust to available sequence data, there remains the question of how multiplexing affects coverage of the transcriptome and detection of low abundant or rare transcripts. This coverage issue will increasingly be counterbalanced by rapid increases in data generation capacity from a single sequencing experiment. In a detailed study of the Marioni

Conclusions

Not surprisingly, our results indicate that more biological replicates are needed to improve the quality and reliability of DE detection using RNA-Seq. Importantly however, we also find that DE analysis with RNA-Seq is robust to substantial loss of sequencing data as indicated by a slow decline in TPR accompanied by no increase in FPR. Our simulations strongly support that multiplexing experimental designs improve TPR and FPR while greatly reducing the cost of the experiment, as the benefits of multiplexing n-biological replicates far outweigh the decrease of available data per sample by 1/

As many available packages for DE analysis are increasingly becoming faster and easier to use, our recommendation for most RNA-Seq DE experiments is to use 2 different packages for DE testing. Additional file

**Figure S4.** Venn-diagram showing the TP and FP calls made by DESeq (left, blue circle) and edgeR (right, red circle) and how they overlap between each other and the total pool of transcripts designated as truly DE (top, green circle). **A:** the Venn-diagram for the case in which the number of biological replicates is **B:** the Venn-diagram for

Click here for file

To our knowledge, this is the most up-to-date comparison of DESeq and edgeR’s performance relative to ability to detect DE in a range of experimental designs. It directly tests the efficiency of modern multiplex experimental design strategies. Our study informs important experimental design decisions now relevant when trying to maximise an RNA-Seq study to reliably detect DE.

Methods

Negative binomial model and biological variation simulation

Our synthetic data is based on a negative binomial (NB) model of read counts assumed by

The mean

**Negative binomial model**
^{
c
}**.**

Click here for file

R packages for DE in RNA-Seq

All three packages considered are based on a NB model, and differ principally in the way the dispersion parameter is estimated. Unless otherwise stated, tests of these packages used herein use default settings. Typical coding sequences are given in the Additional file

edgeR (version 2.4.0, Bioconductor)

To begin with, edgeR

DESeq (version 1.6.1, Bioconductor)

In previous versions of the package DESeq

using a gamma-family generalised linear model. The per-transcript estimate is considered to be more appropriate when large numbers of replicates (≥ 4) are present, while the functional form is considered to be more appropriate when small numbers of replicates (≤ 2) are present, in which case information is borrowed from the general trend of all transcripts. Recognising that the dispersion may be underestimated by the functional fit, leading to an overestimate of significance in detecting DE, DESeq by default chooses the maximum of the two methods for each transcript. Also by default, DESeq assumes a model in which the mean

NBPSeq (version 0.1.4, CRAN)

As for edgeR, the package NBPSeq

that is, a linear relationship between log

Construction of the synthetic datasets

Each of our synthetic datasets consists of a ‘control’ dataset of read counts

For each transcript isoform, we begin by providing a pair of NB parameters
_{
i
}= 1,…,

The basis for the parameters
^{6} to 20 × 10^{6}. To provide a uniform set of biological replicates from which to estimate
^{6} to 16 × 10^{6} was chosen. Finally, any transcript for which the total number of reads was less than 44, i.e. an average of less than one transcript per lane, was culled from the dataset to leave a list of 46,446 transcripts. The resulting subset of the Pickrell dataset is considered to exhibit overdispersion due to both library preparation and biological variation.

Note that for generation of synthetic data it is not necessary to provide an accurate estimate of _{
i
} and _{
i
} for each isoform in the reduced Pickrell dataset, but simply to provide a plausible distribution of values of these parameters over the transcriptome representing typical isoform abundances and their variation due to technical and/or biological overdispersion. Parameter values

For each transcript a maximum likelihood estimate (MLE)

Two sets of simulations were performed:

1. To test performance under the null hypothesis, the regulating factor was set to _{
i
}= 1 for all transcripts.

2. To test ability to detect DE in the presence of an alternative hypothesis, the regulating factor _{
i
}was set to 1 + _{
i
}for a randomly chosen 7.5% of the transcripts (up-regulated), (1 + _{
i
})^{−1} for a further 7.5% (down-regulated) and 1 for the remaining 85% of the transcripts, where the _{
i
}are identically and independently distributed exponential random variables with mean 1.

Calculation of true and false positive rates

Under the null hypothesis

All three packages test for DE in single-factor experiments by calculating p-values using the method described in

To test the performance of each package under the null hypothesis, we simulated sets of

Ideally, the FPR should match the significance level of

In the presence of an alternative hypothesis

All three packages provide an adjusted p-value, _{adj}, to correct for multiple hypothesis testing with the Benjamini-Hochberg procedure using the R function p.adjust(). All calculations herein of true and false positive rates in the presence of an alternative hypothesis use adjusted p-values.

From the 6,966/46,446 (15%) of the transcripts induced with a regulating factor other than 1, we selected the 5,726 (12%) with a regulation factor satisfying either _{
i
}≤ 0.83 or _{
i
}≥ 1.20. We define these as “effectively DE” transcripts. This additional filter on minimal fold-change is designed to quantify the performance of algorithms and experimental designs for detection of DE that might be considered more biologically relevant by researchers. Likewise we define the remaining transcripts, those satisfying 0.83 <_{
i
}< 1.20, as “effectively non-DE”. These definitions were used to estimate the FPR and TPR at significance level

Apart from the use of adjusted p-values, the formula for FPR reduces to Eq. 4 if the number of simulated DE transcripts is set to zero, since in this case all transcripts are, by definition, “effectively non-DE”. The quantities 1−FPR and TPR are commonly referred to in the literature as “specificity” and “sensitivity” respectively.

Simulating variable levels of sequence data and replication

Simulating variations in available sequencing data is a fundamental part of investigating the impacts of multiplex experimental design strategies. Variability in the amount of sequence data amongst samples can occur for reasons such as restrictions on available resources, machine error, or sequencing reads sequestered by pathogen transcriptome fractions present in the sample. To simulate loss of sequencing depth, we randomly sub-sampled without replacement counts from the original table of counts simulated in the presence of an alternative hypothesis for each biological replicate. Sequencing depth was decreased in both control and treatment samples over a range of 100% (a full lane of sequence) to 1% of the original data. After this sub-sampling, the resulting table of counts was analysed in DESeq (edgeR) and the total number of effectively-DE calls, TPR, FPR and fold-changes were recorded for every

Multiplexing experimental designs

Multiplexing various samples into one sequencing lane reduces the monetary cost of RNA-Seq DE analysis, albeit by dividing the available sequencing depth over various samples. Our strategy consisted of simulating multiplexing

• 2 vs. 2 biological replicates at 50% sequencing depth

• 3 vs. 3 biological replicates at 33% sequencing depth

• 4 vs. 4 biological replicates at 25% sequencing depth

• 6 vs. 6 biological replicates at 17% sequencing depth

• 8 vs. 8 biological replicates at 13% sequencing depth

• 12 vs. 12 biological replicates at 8% sequencing depth

• 32 vs. 32 biological replicates at 3% sequencing depth

• 96 vs. 96 biological replicates at 1% sequencing depth

Endnotes

^{a}Our results obtained using edgeR are presented in the Additional file

^{b}Additional file
_{adj} ≤ 0.01 for every

^{c}The details of our negative binomial model can be found in Additional file

**Figure S3.** Smallest fold-change required for a transcript to be called DE (_{adj} ≤ 0.01) as a function of

Click here for file

**Figure S1.** Maximum likelihood estimates of the NB mean

Click here for file

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors contributed to the manuscript. JAR, SEQ and SJS performed the statistical and bioinformatic analysis. CJB developed the data simulation algorithm. CJB, SRW and JMT conceived and designed the study. All authors read and approved the final manuscript.

Acknowledgements

JAR is funded by the OCE Science team. This project is supported by Australian Research Council Discovery Grant DP1094699. We used