Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Improved base-calling and quality scores for 454 sequencing based on a Hurdle Poisson model

Kristof De Beuf1*, Joachim De Schrijver1, Olivier Thas12, Wim Van Criekinge1, Rafael A Irizarry3 and Lieven Clement45*

Author Affiliations

1 Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, B9000 Ghent, Belgium

2 Centre for Statistical and Survey Methodology, School of Mathematics and Applied Statistics, University of Wollongong, NSW 2522, Australia

3 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, MD, USA

4 Department of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281-S9, B9000 Ghent, Belgium

5 Interuniversity Institute for Biostatistics and Statistical Bioinformatics, Katholieke Universiteit Leuven and Universiteit Hasselt, Kapucijnenvoer 35, Blok D, bus 7001, B3000 Leuven, Belgium

For all author emails, please log on.

BMC Bioinformatics 2012, 13:303  doi:10.1186/1471-2105-13-303

Published: 15 November 2012

Abstract

Background

454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform.

Results

We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores.

Conclusions

Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.