Email updates

Keep up to date with the latest news and content from BMC Structural Biology and BioMed Central.

Open Access Highly Accessed Research article

Partially-supervised protein subclass discovery with simultaneous annotation of functional residues

Benjamin Georgi14*, Jörg Schultz2 and Alexander Schliep13

Author Affiliations

1 Max Planck Institute for Molecular Genetics, Dept. of Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany

2 Biozentrum, Dept. of Bioinformatics, Universität Würzburg, 97074 Würzburg, Germany

3 Current address: Dept. of Computer Science and BioMaPS Institute for Quantitative Biology, Rutgers, The State University of New Jersey, Piscataway, NJ, 08854, USA

4 Current address: Department of Genetics, University of Pennsylvania, 528 CRB, 415 Curie Blvd PA 19104 Philadelphia, USA

For all author emails, please log on.

BMC Structural Biology 2009, 9:68  doi:10.1186/1472-6807-9-68

Published: 26 October 2009



The study of functional subfamilies of protein domain families and the identification of the residues which determine substrate specificity is an important question in the analysis of protein domains. One way to address this question is the use of clustering methods for protein sequence data and approaches to predict functional residues based on such clusterings. The locations of putative functional residues in known protein structures provide insights into how different substrate specificities are reflected on the protein structure level.


We have developed an extension of the context-specific independence mixture model clustering framework which allows for the integration of experimental data. As these are usually known only for a few proteins, our algorithm implements a partially-supervised learning approach. We discover domain subfamilies and predict functional residues for four protein domain families: phosphatases, pyridoxal dependent decarboxylases, WW and SH3 domains to demonstrate the usefulness of our approach.


The partially-supervised clustering revealed biologically meaningful subfamilies even for highly heterogeneous domains and the predicted functional residues provide insights into the basis of the different substrate specificities.