Email updates

Keep up to date with the latest news and content from BMC Structural Biology and BioMed Central.

Open Access Highly Accessed Research article

Evolutionarily consistent families in SCOP: sequence, structure and function

Ralph B Pethica1*, Michael Levitt2 and Julian Gough1

Author affiliations

1 Department of Computer Science, University of Bristol, The Merchant Venturers Building, Room 3.16, Woodland Road, Bristol, UK

2 Department of Structural Biology, Stanford University School of Medicine, Stanford, 94305, CA, USA

For all author emails, please log on.

Citation and License

BMC Structural Biology 2012, 12:27  doi:10.1186/1472-6807-12-27

Published: 18 October 2012

Abstract

Background

SCOP is a hierarchical domain classification system for proteins of known structure. The superfamily level has a clear definition: Protein domains belong to the same superfamily if there is structural, functional and sequence evidence for a common evolutionary ancestor. Superfamilies are sub-classified into families, however, there is not such a clear basis for the family level groupings. Do SCOP families group together domains with sequence similarity, do they group domains with similar structure or by common function? It is these questions we answer, but most importantly, whether each family represents a distinct phylogenetic group within a superfamily.

Results

Several phylogenetic trees were generated for each superfamily: one derived from a multiple sequence alignment, one based on structural distances, and the final two from presence/absence of GO terms or EC numbers assigned to domains. The topologies of the resulting trees and confidence values were compared to the SCOP family classification.

Conclusions

We show that SCOP family groupings are evolutionarily consistent to a very high degree with respect to classical sequence phylogenetics. The trees built from (automatically generated) structural distances correlate well, but are not always consistent with SCOP (hand annotated) groupings. Trees derived from functional data are less consistent with the family level than those from structure or sequence, though the majority still agree. Much of GO and EC annotation applies directly to one family or subset of the family; relatively few terms apply at the superfamily level. Maximum sequence diversity within a family is on average 22% but close to zero for superfamilies.