Log on / register
Feedback | Support | My details
Open AccessHighly AccessMethodology article

Large scale hierarchical clustering of protein sequences

Antje Krause1,3 email, Jens Stoye2 email and Martin Vingron1 email

Max Planck Institute for Molecular Genetics, Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany

Universität Bielefeld, Technische Fakultät, AG Genominformatik, Postfach 100131, 33501 Bielefeld, Germany

TFH Wildau, Bahnhofstrasse 1, 15745 Wildau, Germany

author email corresponding author email

BMC Bioinformatics 2005, 6:15doi:10.1186/1471-2105-6-15

Published: 22 January 2005

Abstract

Background

Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to.

Results

We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/ webcite.

Conclusions

Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.