Simrank: Rapid and sensitive general-purpose k-mer search tool
1 Ecology Department, Lawrence Berkeley National Laboratory, Berkeley, USA
2 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, USA
3 Department of Microbiology, New York University School of Medicine, New York, USA
4 Department of Molecular Biology, Aarhus University, Aarhus, Denmark
5 Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, New York, USA
BMC Ecology 2011, 11:11 doi:10.1186/1472-6785-11-11Published: 27 April 2011
Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp webcite. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available.
Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset.
Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity.