Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Highly Accessed Research article

Assessing pooled BAC and whole genome shotgun strategies for assembly of complex genomes

Niina Haiminen1*, F Alex Feltus23 and Laxmi Parida1

Author Affiliations

1 IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA

2 Department of Genetics & Biochemistry, Clemson University, Clemson, SC 29634, USA

3 Clemson University Genomics Institute, Clemson University, Clemson, SC 29634, USA

For all author emails, please log on.

BMC Genomics 2011, 12:194  doi:10.1186/1471-2164-12-194

Published: 15 April 2011

Abstract

Background

We investigate if pooling BAC clones and sequencing the pools can provide for more accurate assembly of genome sequences than the "whole genome shotgun" (WGS) approach. Furthermore, we quantify this accuracy increase. We compare the pooled BAC and WGS approaches using in silico simulations. Standard measures of assembly quality focus on assembly size and fragmentation, which are desirable for large whole genome assemblies. We propose additional measures enabling easy and visual comparison of assembly quality, such as rearrangements and redundant sequence content, relative to the known target sequence.

Results

The best assembly quality scores were obtained using 454 coverage of 15× linear and 5× paired (3kb insert size) reads (15L-5P) on Arabidopsis. This regime gave similarly good results on four additional plant genomes of very different GC and repeat contents. BAC pooling improved assembly scores over WGS assembly, coverage and redundancy scores improving the most.

Conclusions

BAC pooling works better than WGS, however, both require a physical map to order the scaffolds. Pool sizes up to 12Mbp work well, suggesting this pooling density to be effective in medium-scale re-sequencing applications such as targeted sequencing of QTL intervals for candidate gene discovery. Assuming the current Roche/454 Titanium sequencing limitations, a 12 Mbp region could be re-sequenced with a full plate of linear reads and a half plate of paired-end reads, yielding 15L-5P coverage after read pre-processing. Our simulation suggests that massively over-sequencing may not improve accuracy. Our scoring measures can be used generally to evaluate and compare results of simulated genome assemblies.