Open Access Database

The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools

Andreas Wilke12, Travis Harrison12, Jared Wilkening15, Dawn Field3, Elizabeth M Glass12, Nikos Kyrpides4, Konstantinos Mavrommatis4 and Folker Meyer125*

Author Affiliations

1 Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S. Cass Ave., Argonne, IL, 60439, USA

2 Computation Institute, University of Chicago, 5735 South Ellis Avenue, Chicago, IL, 60637, USA

3 Centre for Ecology & Hydrology, Maclean Building, Crowmarsh Gifford, Wallingford, Oxfordshire, United Kingdom

4 Department of Energy Joint Genome Institute, Walnut Creek, CA, USA

5 Institute for Genomics and Systems Biology, 900 East 57th Street, Chicago, IL, 60637, USA

For all author emails, please log on.

BMC Bioinformatics 2012, 13:141  doi:10.1186/1471-2105-13-141

Published: 21 June 2012

Abstract

Background

Computing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference.

Description

We introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBI's GenBank.

Conclusions

The data and tools we present allow the creation of multiple result sets using a single computation, permitting computational results to be shared between groups for large sequence data sets.