This article is part of the supplement: Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010

Open Access Proceedings

WordSeeker: concurrent bioinformatics software for discovering genome-wide patterns and word-based genomic signatures

Jens Lichtenberg1*, Kyle Kurz1, Xiaoyu Liang1, Rami Al-ouran1, Lev Neiman1, Lee J Nau1, Joshua D Welch1, Edwin Jacox2, Thomas Bitterman3, Klaus Ecker1, Laura Elnitski4, Frank Drews1, Stephen Sauchi Lee5 and Lonnie R Welch167

Author affiliations

1 Bioinformatics Laboratory, School of EECS, Ohio University, Athens, Ohio 45701, USA

2 Developmental Biology Institute of Marseille, Luminy F-13009, Marseille, France

3 Cyberinfrastructure Group, Ohio Supercomputer Center, Columbus, Ohio 43212, USA

4 Genomic Functional Analysis Section, National Human Genome Research Institute, NIH, Rockville, Maryland 20892 USA

5 Department of Statistics, University of Idaho, Moscow, Idaho 83844, USA

6 Biomedical Engineering Program, Ohio University, Athens, Ohio 45701, USA

7 Molecular and Cellular Biology Program, Ohio University, Athens, Ohio 45701, USA

For all author emails, please log on.

Citation and License

BMC Bioinformatics 2010, 11(Suppl 12):S6  doi:10.1186/1471-2105-11-S12-S6

Published: 21 December 2010



An important focus of genomic science is the discovery and characterization of all functional elements within genomes. In silico methods are used in genome studies to discover putative regulatory genomic elements (called words or motifs). Although a number of methods have been developed for motif discovery, most of them lack the scalability needed to analyze large genomic data sets.


This manuscript presents WordSeeker, an enumerative motif discovery toolkit that utilizes multi-core and distributed computational platforms to enable scalable analysis of genomic data. A controller task coordinates activities of worker nodes, each of which (1) enumerates a subset of the DNA word space and (2) scores words with a distributed Markov chain model.


A comprehensive suite of performance tests was conducted to demonstrate the performance, speedup and efficiency of WordSeeker. The scalability of the toolkit enabled the analysis of the entire genome of Arabidopsis thaliana; the results of the analysis were integrated into The Arabidopsis Gene Regulatory Information Server (AGRIS). A public version of WordSeeker was deployed on the Glenn cluster at the Ohio Supercomputer Center.


WordSeeker effectively utilizes concurrent computing platforms to enable the identification of putative functional elements in genomic data sets. This capability facilitates the analysis of the large quantity of sequenced genomic data.