Log on / register
Feedback | Support | My details
Open AccessHighly AccessSoftware

GBParsy: A GenBank flatfile parser library with high speed

Tae-Ho Lee1,2 email, Yeon-Ki Kim2 email and Baek Hie Nahm1,2 email

1Division of Bioscience and Bioinformatics, MyongJi University, Yongin, Kyonggido, Republic of Korea

2Genomics Genetics Institute, GreenGene BioTech Inc., Yongin, Kyonggido, Republic of Korea

author email corresponding author email

BMC Bioinformatics 2008, 9:321doi:10.1186/1471-2105-9-321

Published: 25 July 2008

Abstract

Background

GenBank flatfile (GBF) format is one of the most popular sequence file formats because of its detailed sequence features and ease of readability. To use the data in the file by a computer, a parsing process is required and is performed according to a given grammar for the sequence and the description in a GBF. Currently, several parser libraries for the GBF have been developed. However, with the accumulation of DNA sequence information from eukaryotic chromosomes, parsing a eukaryotic genome sequence with these libraries inevitably takes a long time, due to the large GBF file and its correspondingly large genomic nucleotide sequence and related feature information. Thus, there is significant need to develop a parsing program with high speed and efficient use of system memory.

Results

We developed a library, GBParsy, which was C language-based and parses GBF files. The parsing speed was maximized by using content-specified functions in place of regular expressions that are flexible but slow. In addition, we optimized an algorithm related to memory usage so that it also increased parsing performance and efficiency of memory usage. GBParsy is at least 5 - 100× faster than current parsers in benchmark tests.

Conclusion

GBParsy is estimated to extract annotated information from almost 100 Mb of a GenBank flatfile for chromosomal sequence information within a second. Thus, it should be used for a variety of applications such as on-time visualization of a genome at a web site.


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.