Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Sequencing error correction without a reference genome

Julie A Sleep12*, Andreas W Schreiber34 and Ute Baumann1

Author Affiliations

1 Australian Centre for Plant Functional Genomics, The University of Adelaide, Urrbrae, SA 5064, Australia

2 Phenomics and Bioinformatics Research Centre, University of South Australia, Mawson Lakes, SA 5095, Australia

3 ACRF South Australian Cancer Genome Facility, Centre for Cancer Biology, SA Pathology, Adelaide, SA 5000, Australia

4 School of Molecular and Biomedical Science, University of Adelaide, Adelaide, SA 5000, Australia

For all author emails, please log on.

BMC Bioinformatics 2013, 14:367  doi:10.1186/1471-2105-14-367

Published: 18 December 2013

Abstract

Background

Next (second) generation sequencing is an increasingly important tool for many areas of molecular biology, however, care must be taken when interpreting its output. Even a low error rate can cause a large number of errors due to the high number of nucleotides being sequenced. Identifying sequencing errors from true biological variants is a challenging task. For organisms without a reference genome this difficulty is even more challenging.

Results

We have developed a method for the correction of sequencing errors in data from the Illumina Solexa sequencing platforms. It does not require a reference genome and is of relevance for microRNA studies, unsequenced genomes, variant detection in ultra-deep sequencing and even for RNA-Seq studies of organisms with sequenced genomes where RNA editing is being considered.

Conclusions

The derived error model is novel in that it allows different error probabilities for each position along the read, in conjunction with different error rates depending on the particular nucleotides involved in the substitution, and does not force these effects to behave in a multiplicative manner. The model provides error rates which capture the complex effects and interactions of the three main known causes of sequencing error associated with the Illumina platforms.