Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Software

Processing genome scale tabular data with wormtable

Jerome Kelleher*, Rob W Ness and Daniel L Halligan

Author Affiliations

University of Edinburgh, King’s Buildings, West Mains Road, Edinburgh, EH9 3JT, UK

For all author emails, please log on.

BMC Bioinformatics 2013, 14:356  doi:10.1186/1471-2105-14-356

Published: 5 December 2013

Abstract

Background

Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Furthermore, there is no simple means of indexing these files so that rows containing particular values can be quickly found.

Results

We introduce a new data format and software library called wormtable, which provides efficient access to tabular data in Python. Wormtable stores data in a compact binary format, provides random access to rows, and enables sophisticated indexing on columns within these tables. Files written in existing formats can be easily converted to wormtable format, and we provide conversion utilities for the VCF and GTF formats.

Conclusions

Wormtable’s simple API allows users to process large tables orders of magnitude more quickly than is possible when parsing text. Furthermore, the indexing facilities provide efficient access to subsets of the data along with providing useful methods of summarising columns. Since third-party libraries or custom code are no longer needed to parse complex plain text formats, analysis code can also be substantially simpler as well as being uniform across different data formats. These benefits of reduced code complexity and greatly increased performance allow users much greater freedom to explore their data.