Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2012)

Open Access Proceedings

MGMR: leveraging RNA-Seq population data to optimize expression estimation

Roye Rozov1, Eran Halperin123* and Ron Shamir1

Author Affiliations

1 The Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 69978, Israel

2 Molecular Microbiology and Biotechnology Department, Tel-Aviv University, Tel Aviv 69978, Israel

3 International Computer Science Institute, Berkeley, CA, 94704, USA

For all author emails, please log on.

BMC Bioinformatics 2012, 13(Suppl 6):S2  doi:10.1186/1471-2105-13-S6-S2

Published: 19 April 2012

Abstract

Background

RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples

Results

In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes.

Conclusions

We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.