Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the 2012 International Conference on Intelligent Computing (ICIC 2012)

Open Access Proceedings

Acceleration of sequence clustering using longest common subsequence filtering

Youhei Namiki, Takashi Ishida and Yutaka Akiyama*

Author Affiliations

Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Ookayama, Meguro, Tokyo 152-8552, Japan

For all author emails, please log on.

BMC Bioinformatics 2013, 14(Suppl 8):S7  doi:10.1186/1471-2105-14-S8-S7

Published: 9 May 2013

Abstract

Background

Huge numbers of genomes can now be sequenced rapidly with recent improvements in sequencing throughput. However, data analysis methods have not kept up, making it difficult to process the vast amounts of available sequence data. This increased processing time is especially critical in DNA sequence clustering because of the intrinsic difficulty in parallelization. Thus, there is a strong demand for a faster clustering algorithm.

Results

We developed a new fast DNA sequence clustering method called LCS-HIT, based on the popular CD-HIT program. The proposed method uses a novel filtering technique based on the longest common subsequence to identify similar sequence pairs. This filtering technique makes the LCS-HIT considerably faster than CD-HIT, without loss of sensitivity. For a dataset of two million DNA sequences, our method was approximately 7.1, 4.4, and 2.2 times faster than CD-HIT for 100, 150, and 400 bases, respectively.

Conclusions

The LCS-HIT clustering program, using a novel filtering technique based on the longest common subsequence, is significantly faster than CD-HIT without compromising clustering accuracy. Moreover, the filtering technique itself is independent from the CD-HIT algorithm. Thus, this technique can be applied to similar clustering algorithms.