Efficient de novo assembly of large and complex genomes by massively parallel sequencing of Fosmid pools
1 School of Biotechnology, Science for Life Laboratory, KTH Royal Institute of Technology, Box 1031, 171 21 Solna, Sweden
2 Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 171 21 Solna, Sweden
3 Lucigen Corporation, 2120 W Greenview Dr., Suite 9, Middleton, WI 53562, USA
4 CLC bio A/S, Silkeborgvej 2, 8000 Aarhus C, Denmark
5 BACPAC Resources, Children’s Hospital of Oakland Research Institute, Bruce Lyon Memorial Research Building, Oakland, California 94609, USA
6 Present Addresses: Department of Microbiology, Bioinformatics Infrastructure for Life Sciences, Science for Life Laboratory, Tumour and Cell Biology, Karolinska Institutet, 17177 Stockholm, Sweden
7 Present Addresses: Department of Cell and Molecular Biology, SciLifeLab, WABI, Uppsala University, Uppsala, Sweden
8 Present Addresses: Intact Genomics Inc, 1100 Corporate Square Drive, Suite 257, St. Louis, Missouri, USA
BMC Genomics 2014, 15:439 doi:10.1186/1471-2164-15-439Published: 6 June 2014
Sampling genomes with Fosmid vectors and sequencing of pooled Fosmid libraries on the Illumina platform for massive parallel sequencing is a novel and promising approach to optimizing the trade-off between sequencing costs and assembly quality.
In order to sequence the genome of Norway spruce, which is of great size and complexity, we developed and applied a new technology based on the massive production, sequencing, and assembly of Fosmid pools (FP). The spruce chromosomes were sampled with ~40,000 bp Fosmid inserts to obtain around two-fold genome coverage, in parallel with traditional whole genome shotgun sequencing (WGS) of haploid and diploid genomes. Compared to the WGS results, the contiguity and quality of the FP assemblies were high, and they allowed us to fill WGS gaps resulting from repeats, low coverage, and allelic differences. The FP contig sets were further merged with WGS data using a novel software package GAM-NGS.
By exploiting FP technology, the first published assembly of a conifer genome was sequenced entirely with massively parallel sequencing. Here we provide a comprehensive report on the different features of the approach and the optimization of the process.
We have made public the input data (FASTQ format) for the set of pools used in this study:
The software used for running the assembly process is available at http://research.scilifelab.se/andrej_alexeyenko/downloads/fpools/ webcite.