Open Access Highly Accessed Methodology article

Improved base-calling and quality scores for 454 sequencing based on a Hurdle Poisson model

Kristof De Beuf1*, Joachim De Schrijver1, Olivier Thas12, Wim Van Criekinge1, Rafael A Irizarry3 and Lieven Clement45*

Author Affiliations

1 Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, B9000 Ghent, Belgium

2 Centre for Statistical and Survey Methodology, School of Mathematics and Applied Statistics, University of Wollongong, NSW 2522, Australia

3 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, MD, USA

4 Department of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281-S9, B9000 Ghent, Belgium

5 Interuniversity Institute for Biostatistics and Statistical Bioinformatics, Katholieke Universiteit Leuven and Universiteit Hasselt, Kapucijnenvoer 35, Blok D, bus 7001, B3000 Leuven, Belgium

For all author emails, please log on.

BMC Bioinformatics 2012, 13:303  doi:10.1186/1471-2105-13-303

Published: 15 November 2012

Additional files

Additional file 1:

Probability of miscalls by native 454 base-caller. Probability of miscalls by the native 454 base-caller for different HPLs. The base-calling error rate clearly increases by increasing HPL and becomes quite substantial from HPL 4.

Format: PDF Size: 24KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 2:

Overview of HPCall base-calling pipeline. Overview of the HPCall base-calling pipeline. Different source files are merged in a data preparation step before the base-calling takes place. The output of the pipeline contains base-called sequence reads, Phred-like quality scores, and base-calling probabilities for the different HPLs.

Format: PDF Size: 225KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3:

Smooth functions in Hurdle Poisson model. Specification of the smooth functions fj and gj in the Hurdle Poisson model.

Format: PDF Size: 28KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 4:

Distribution of quality scores at HPLref3 undercall. The empirical cumulative distribution function of 454 quality scores (upper) and HPCall quality scores QSundercall (lower) for sequences with reference HPLref 3 assigned to bases associated with HPL 2, 3 and 4. Because of the undercall only 454 quality scores associated with HPL 2 are available. The HPCall quality scores associated with HPL 3 and HPL 4 are mostly very high, whereas this is not the case for those associated with HPL 2. HPCall clearly indicates that undercalls are more likely in this situation, whereas this insight is not provided by the 454 quality scores.

Format: PDF Size: 24KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Distribution of new informative quality scores at HPLref3. The empirical cumulative distribution function of HPCall quality scores QSHPCall for sequences with reference HPLref 3 assigned to bases associated with HPL 2, 3 and 4, in the case of an undercall (upper), correct call (middle) or overcall (lower).

Format: PDF Size: 34KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

Histograms of estimated probabilities by HPCall. (A) Histograms of the maximal estimated probabilities by HPCall in the case of a correct call (upper left), and (B) in the case of a miscall (upper right). (C) The histogram in the lower left panel gives the distribution of estimated probabilities for the reference HPLs in the case of a miscall. These very often correspond with the reference HPL. (D) The lower right panel gives the histogram of the sum of the probabilities given in the upper right and lower left panel. These two probabilities almost always sum to a value close to 1.

Format: PDF Size: 21KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 7:

Comparison of absolute number of base-calling errors. Comparison of the absolute number of base-calling errors by HPL for the three base-calling methods. Using HPCall leads to an overall decrease of the number of base-calling errors of 35% compared to the native 454 base-caller. The lower number of base-calling errors for HPCall is consistent throughout the complete range of reference HPLs.

Format: PDF Size: 30KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 8:

Comparison of base-calling prediction accuracy. Prediction accuracy for the different base-calling methods separated by nucleotide type. Although the prediction accuracy of the native base-caller is already quite high, HPCall obtains higher prediction accuracies for each individual nucleotide type. This is still the case if only flowgram values (fg) are used. Both HPCall and the native 454 base-caller clearly outperform Pyrobayes.

Format: PDF Size: 25KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 9:

Variability of the prediction accuracy by HPCall. Variability of the prediction accuracy of HPCall. The obtained prediction accuracies are very stable among the different random samples of training data. The standard deviations of the prediction accuracies range from 0.000024 (for nucleotide C) to 0.000047 (for nucleotide T).

Format: PDF Size: 25KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 10:

Comparison of mapping mismatches. Percentage of reads with different numbers of mismatches in the mapping between the reads produced by either HPCall or the native 454 base-caller and the E. coli K-12 reference sequence. For mapping either ssaha2 or subread is used. Detected number of sequence variants for the E. coli data set using ssahaSNP. HPCall results in more perfect-matching reads and less overall indels and SNPs.

Format: PDF Size: 18KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 11:

Base-calling of human TP53 454 amplicon resequencing data. Percentage of reads with different number of mismatches in the mapping between either HPCall (with or without training) or the native 454 base-caller and the human TP53 gene reference sequence. HPCall results in more perfect-matching reads. When trained on the E. coli data set the percentage of perfect-matching reads decreases slightly.

Format: PDF Size: 18KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 12:

Base-calling of PGM 314E. colidata with HPCall. Cumulative percentage of reads as a function of mismatches per read in the mapping between the reads produced by either HPCall or the standard Ion PGM base-caller and the E. coli DH10B reference sequence. The results for HPCall are promising.

Format: PDF Size: 28KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data