Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Open Badges Research article

Automatic structure classification of small proteins using random forest

Pooja Jain and Jonathan D Hirst*

Author Affiliations

School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK

For all author emails, please log on.

BMC Bioinformatics 2010, 11:364  doi:10.1186/1471-2105-11-364

Published: 1 July 2010



Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs.


Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP Class, Fold, Super-family or Family levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases.


The utility of random forest in classifying domains from the place-holder classes of SCOP to the true Class, Fold, Super-family or Family levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.