Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

Parallelized short read assembly of large genomes using de Bruijn graphs

Yongchao Liu1*, Bertil Schmidt2* and Douglas L Maskell1

Author Affiliations

1 School of Computer Engineering, Nanyang Technological University, Singapore

2 Institut für Informatik, Johannes Gutenberg University Mainz, Germany

For all author emails, please log on.

BMC Bioinformatics 2011, 12:354  doi:10.1186/1471-2105-12-354

Published: 25 August 2011

Abstract

Background

Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads.

Results

We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources.

Conclusions

Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.