Log on / register
Feedback | Support

This article is part of the supplement: Proceedings of the 10th Bio-Ontologies Special Interest Group Workshop 2007. Ten years past and looking to the future

Open AccessProceedings

Metrics for GO based protein semantic similarity: a systematic evaluation

Catia Pesquita1 email, Daniel Faria1 email, Hugo Bastos1 email, António EN Ferreira2 email, André O Falcão1 email and Francisco M Couto1 email

1XLDB, Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Campo Grande - Edifício C6, Lisboa, Portugal

2Centro de Química e Bioquímica, Departmento de Química e Bioquimica, Faculdade de Ciências da Universidade de Lisboa, Campo Grande - Edificio C8, Lisboa, Portugal

author email corresponding author email

BMC Bioinformatics 2008, 9(Suppl 5):S4doi:10.1186/1471-2105-9-S5-S4

Published: 29 April 2008

Abstract

Background

Several semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations.

Results

We conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation.

Conclusions

This work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid simGIC was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.


© 1999-2008 BioMed Central Ltd unless otherwise stated