Email updates

Keep up to date with the latest news and content from BMC Ecology and BioMed Central.

Open Access Software

Simrank: Rapid and sensitive general-purpose k-mer search tool

Todd Z DeSantis1*, Keith Keller2, Ulas Karaoz1, Alexander V Alekseyenko5, Navjeet NS Singh1, Eoin L Brodie1, Zhiheng Pei3, Gary L Andersen1 and Niels Larsen4

Author affiliations

1 Ecology Department, Lawrence Berkeley National Laboratory, Berkeley, USA

2 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, USA

3 Department of Microbiology, New York University School of Medicine, New York, USA

4 Department of Molecular Biology, Aarhus University, Aarhus, Denmark

5 Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, New York, USA

For all author emails, please log on.

Citation and License

BMC Ecology 2011, 11:11  doi:10.1186/1472-6785-11-11

Published: 27 April 2011

Abstract

Background

Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp webcite. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available.

Results

Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset.

Conclusions

Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity.