Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

Tobias Wittkop123, Jan Baumbach124*, Francisco P Lobo15 and Sven Rahmann6

Author Affiliations

1 Computational Methods for Emerging Technologies, Bielefeld University, Bielefeld, Germany

2 Genome informatics, Bielefeld University, Bielefeld, Germany

3 DFG Graduiertenkolleg Bioinformatik, Bielefeld University, Bielefeld, Germany

4 International Graduate School in Bioinformatics and Genome Research, Center for Biotechnology, Bielefeld, Germany

5 Laboratorio de Genetica Bioquimica, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

6 Bioinformatics for High-Throughput Technologies, Technical University of Dortmund, Germany

For all author emails, please log on.

BMC Bioinformatics 2007, 8:396  doi:10.1186/1471-2105-8-396

Published: 17 October 2007

Abstract

Background

Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed.

Results

We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences (66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet.

Conclusion

FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/ webcite.