School of Mathematics and Statistics, University of Sydney, Sydney NSW 2006, Australia

CSIRO Mathematical and Information Sciences, Private Bag 33, Clayton South 3168, Australia

Department of Biomedical Sciences, Cornell University, Ithaca, NY, USA

Abstract

Background

The cost of RNA-Seq has been decreasing over the last few years. Despite this, experiments with four or less biological replicates are still quite common. Estimating the variances of gene expression estimates becomes both a challenging and interesting problem in these situations of low replication. However, with the wealth of microarray and other publicly available gene expression data readily accessible on public repositories, these sources of information can be leveraged to make improvements in variance estimation.

Results

We have proposed a novel approach called Tshrink+ for inferring differential gene expression through improved modelling of the gene-wise variances. Existing methods share information between genes of similar average expression by shrinking, or moderating, the gene-wise variances to a fitted common variance. We have been able to achieve improved estimation of the common variance by using gene-wise sample variances from external experiments, as well as gene length.

Conclusions

Using biological data we show that utilising additional external information can improve the modelling of the common variance and hence the calling of differentially expressed genes. These sources of additional information include gene length and gene-wise sample variances from other RNA-Seq and microarray datasets, of both related and seemingly unrelated tissue types. The results of this are promising, with our differential expression test, Tshrink+, performing favourably when compared to existing methods such as DESeq and edgeR when considering both gene ranking and sensitivity. These improved variance models could easily be implemented in both DESeq and edgeR and highlight the need for a database that offers a profile of gene variances over a range of tissue types and organisms.

Background

In the post-genomic era, the development of technologies for sequencing the genome and transcriptome has become a key issue in the global analysis of biological systems. Even with the lowering cost of sequencing data, the majority of RNA-Seq experiments are still suffering from low replication numbers. The identification of differential expressed (DE) genes and transcripts is still a key question of interest in many biological studies. To date, there are many methods that provide a test of whether a gene is DE or not

DESeq and edgeR account for the heteroscedasticity observed in the read counts of genes by modelling the relationship between expected value of the count and its variability. We propose using additional information, such as gene length and variance estimates from external datasets, as explanatory variables to further model the heterogeneity seen in the observed gene variances. Combining these improved models of gene variance with a moderation method

RNA-Seq

The development of high throughput sequencing technologies has made it possible to sequence the transcriptome at a much higher resolution and coverage than was previously available. Sequencing of cDNA samples (RNA-Seq) has a dynamic range larger than that of microarrays

A typical RNA-seq data analysis work flow consists of many steps. These steps generally consist of mapping, summarisation, normalisation, differential expression analysis and systems biology

Let _{ij }^{th }^{th }_{i }

An over-dispersed Poisson, a discrete distribution with dispersion greater than a Poisson, can be modelled using a Negative Binomial. A negative binomial random variable,

This standard formulation is generally referred to as NB2. Under this formulation, the biological variability of the expression of a gene is modelled as a quadratic function of its mean expression

where as ^{2 }where

and ^{2 }>

where ^{2 }highlights that ^{2 }should always be greater than or equal to

In current RNA-Seq experiments it is still quite common to see experiments with very little biological replication. Estimating variances from few observations is unstable

Heterogeneous gene variances

It is well accepted that some genes have a higher variance than other genes

We incorporate external data from RNA-Seq and microarrays on mouse striatum and RNA-Seq data from different tissues to better estimate variances and hence identify DE genes between C57BL/6J and DBA/2J mouse striatum samples.

Methods

Tshrink+

We propose using local regression _{(1)}, γ_{(2) }...) that may help to explain the observed pooled sample variances

When using variance estimates from other RNA-Seq experiments, these variances will also have a very strong mean-variance relationship. For use as an explanatory variable we normalise the external variance estimates in such a way that they have mean zero and variance one for all ranges of expression.

To illustrate how this improved common variance can aid in moderation we propose using a quasi-empirical Bayes moderation method

and

A Wald test for each gene is then performed using the statistic

where we utilise the Welch-Satterthwaite equation _{common }_{k }as

where _{gene }_{shrink}

Data

Bottomly dataset

The Bottomly data

**Additional file 1 includes a description of the normalisation used in the evaluation and additional figures**.

Click here for file

Effect of utilising different sources of information on the estimation of

**Effect of utilising different sources of information on the estimation of λ**. Variance estimates from the external datasets (Table 1) and gene length are used to aid in the estimation of the common variance functions of one hundred comparisons of

Comparing six DE methods on a 4 vs 4 comparison

**Comparing six DE methods on a 4 vs 4 comparison**. One hundred random comparisons of four B6 and four D2 mouse striatum samples for six DE methods. Average TP and FP are calculated for the full range of p-value cut-offs. The TPR and FPR are plotted against each other in a) to form ROC curves and displayed in the region for FPR less than 0.01 as this is most relevant for calling DE. For any given FPR a method with a larger TPR is deemed to have ranked the genes better. In b) the number of TP (in bold) and FP are plotted for a range of p-value cut-offs. The x-axis is in log-scale. The grey dashed vertical line corresponds to a Bonferroni adjusted cut-off of 0.05.

External datasets

Sample variances from three datasets were used as sources of additional information to aid in the estimation of the common variance. These are described in Table

Additional information sources.

**Species**

**Tissue**

**Replicates**

**Platform**

**Source**

**GEO accession**

Liver

6

Spleen

6

C57BL/6J mouse

Thymus

6

RNA-Seq

Keane

GSE30617

Lung

6

Heart

6

Hippocampus

6

C57BL/6J mouse

Striatum

4

RNA-Seq

Polymenidou

GSE27218

C57BL/6J mouse

Striatum

10

microarray

Bottomly

GSE26024

Variance estimates from these three datasets are be used to improve the estimation of the common variance function in the main analysis dataset.

Evaluation study

In this study, we evaluate our proposed method of improving variance estimation for differential gene expression analysis, Tshrink+. This evaluation consists of two components, assessing the capacity of a common variance estimate to explain the observed gene sample variances and evaluating how improving this common variance estimate can aid in the detection of differentially expressed genes. The performance of Tshrink+ will also be compared with two commonly used packages, edgeR and DESeq. This evaluation study is built upon one main dataset, the Bottomly data, and three datasets which are used for additional information.

In order to assess the capacity of a common variance estimate to explain the observed gene sample variances we will use the shrinkage coefficient

We then further demonstrate that improving the information content of an additional information source improves the estimation of the common variance. This will be achieved by using variance estimates from the D2 mice to aid in the estimation of a common variance function of the B6 mice. The variance estimates from a random

We will assess the influence of using additional information and moderation on the detection of differentially expressed (DE) genes. To do this we compare

1. a t-test (T),

2. a moderated t-test (Tshrink) and

3. a moderated t-test using additional information (Tshrink+).

These will also be compared to

4. DESeq using only the common variance (DESeqCommon),

5. DESeq using the maximum of the common variance and sample variance (DESeqMax) and

6. edgeR using a trended common variance and empirical Bayes to shrink the gene sample variances towards the common variance (edgeR).

To assess the effectiveness of the six DE methods, a standard t-test was performed comparing ten B6 and ten D2 mouse striatum samples. In all of the following, the results of this t-test are taken to be the "truth". From this t-test a gene is conservatively called "truly" DE if it has a Bonferroni adjusted p-value of less than 0.05. A gene is called "truly" not DE if it has an unadjusted p-value greater than 0.05. We will then evaluate the ability of the DE methods to recover the information in the comparison of ten B6 samples with ten D2 samples by smaller comparisons of

• generating Receiver Operator Curves (ROC, a curve describing each methods True Positive Rate as a function of its False Positive Rate for a complete range of p-value cut-offs),

• calculating partial areas under the ROC for FPR less than 0.01 and

• calculating True Positives (TP) and False Positives (FP) using a Bonferroni adjusted p-value cut-off of 0.05.

Results and discussion

The estimation of the common variance

We begin by examining the effect of using information from different additional sources to help explain the variances observed in the Bottomly Data. That is, assessing the impact that each of the additional datasets in Table

The more relevant the information contained in the additional data source, the greater the improvement seen in the common variance estimate. As is perhaps expected either of the two striatum tissue datasets, RNA-Seq and microarray, when used to estimate the common variance produce the largest

Improving the accuracy of the sample variance decreases

Using D2 variance estimates to estimate common variance of four B6 samples.

**0**

**2**

**3**

**4**

**5**

**6**

**7**

**8**

**9**

**10**

0.35

0.45

0.50

0.55

0.58

0.65

0.68

0.72

0.75

0.77

The average

The impact of moderation on inferring differential expression

The aim of the remainder of the evaluation is to assess how the use of moderation affects inference on differential gene expression. This is done by assessing the impact of moderation on both gene ranking and sensitivity. Moderation is used to both increase the sensitivity of a test, by increasing the degrees of freedom of the variance estimate, and to improve the ranking of a test, by improving the accuracy of the variance estimate.

We will start by simply comparing the t-test (T), moderated t-test (Tshrink) and a moderated t-test using additional information (Tshrink+). For the additional data source used by Tshrink+, the four striatum RNA-Seq samples

By first considering only four vs four comparisons, the ability of moderation to improve gene ranking is illustrated in Figure

Moderation improves gene ranking and improving what a method moderates too can improve gene ranking further. This is again illustrated in Figure

Partial AUCs and the number of true and false positives for a range of

**Partial AUCs and the number of True and False Positives for a range of n vs n comparisons**. One hundred random comparisons of

Moderation can improve the sensitivity of a test for differential expression as seen in Figure

Comparison with edgeR and DESeq

Tshrink+ performs favourably when compared to both DESeq and edgeR when considering gene ranking. When assessing gene ranking using Figure

Tshrink+ compares comparably to edgeR and DESeq when assessing sensitivity. T selects a similar number of TP at the cut-off when compared edgeR but selects less FP as seen in Figures

Conclusions

Using additional information improves the estimation of the common variance and the detection of differentially expressed genes. Our differential expression test, Tshrink+ which incorporates information from additional datasets, showed marked improvement in both gene ranking and sensitivity over a moderated t-test, Tshrink, and a standard t-test, T. Tshrink+ also performed favourably against edgeR and DESeq when comparing gene ranking and comparably when assessing sensitivity.

Whilst Tshrink+ can offer improvements to a differential expression analysis it also provides insight into avenues for further research. The moderation used in Tshrink+

This methodology should be considered as a complement, not a replacement, for meta-analysis when similar studies to the RNA-Seq study of interest exist. Tshrink+ leverages only the variance estimates from external datasets to improve the variance estimation in the study of interest. If information exists on the changes of expression between conditions as well, a researcher may be remiss to not utilise this information through the use of existing meta-analysis methodologies.

Using external data to improve the estimation of the common variance for a particular problem highlights the significance of access to public data repositories like the gene expression omnibus (GEO)

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

EP developed the method, implemented the algorithm and drafted the manuscript. MB, DL, and YY participated in all aspects of the study and helped to draft the manuscript. All authors read and approve of the final manuscript.

Declarations

The publication costs for this article were funded by the corresponding author's institution.

This article has been published as part of

Acknowledgements

We would like to thank Terry Speed, John Robinson, Uri Keich, Samuel Müller and John Ormerod for their insightful comments. This work was supported in part by ARC through grants FT0991918 (YY), Australian Postgraduate Award (EP) and the Alzheimer's Association (DL).