Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: The Second Automated Function Prediction Meeting

Open Access Proceedings

Clustering protein environments for function prediction: finding PROSITE motifs in 3D

Sungroh Yoon15, Jessica C Ebert2, Eui-Young Chung3, Giovanni De Micheli4 and Russ B Altman2*

Author Affiliations

1 Computer Systems Laboratory, Stanford University, Stanford, CA 94305, USA

2 Department of Genetics, Stanford University, Stanford, CA 94305, USA

3 School of Electrical and Electronic Engineering, Yonsei University, Seoul 120-749, Republic of Korea

4 Integrated Systems Center, Swiss Federal Institute of Technology (EPFL), Lausanne, CH-1015, Switzerland

5 Intel Corporation, 2200 Mission College Blvd., Santa Clara, CA 95054, USA

For all author emails, please log on.

BMC Bioinformatics 2007, 8(Suppl 4):S10  doi:10.1186/1471-2105-8-S4-S10

Published: 22 May 2007

Abstract

Background

Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified.

Results

We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were "rediscovered". We found examples of clusters highly enriched for residues that share PROSITE sequence motifs.

Conclusion

Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.