BMC Bioinformatics

official impact factor 3.03

This article is part of the supplement: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011)

Open Access Research

Repeat-aware modeling and correction of short read errors

Xiao Yang1, Srinivas Aluru1,2 and Karin S Dorman3*

Author Affiliations

1 Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa, 50011, USA

2 Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, Maharashtra, 400 076, India

3 Department of Statistics and Department of Genetics, Development & Cell Biology, Iowa State University, Ames, Iowa, 50011, USA

For all author emails, please log on.

BMC Bioinformatics 2011, 12(Suppl 1):S52 doi:10.1186/1471-2105-12-S1-S52

Published: 15 February 2011

Abstract

Background

High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.

Results

We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem webcite”.

Conclusions

We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.