Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

GC-Content Normalization for RNA-Seq Data

Davide Risso1, Katja Schwartz2, Gavin Sherlock2 and Sandrine Dudoit3*

Author Affiliations

1 Department of Statistical Sciences, University of Padua, Italy

2 Department of Genetics, Stanford University, USA

3 Division of Biostatistics and Department of Statistics, University of California, Berkeley, USA

For all author emails, please log on.

BMC Bioinformatics 2011, 12:480  doi:10.1186/1471-2105-12-480

Published: 17 December 2011

Abstract

Background

Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof.

Results

We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq.

Conclusions

Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes.