Log on / register
Feedback | Support | My details

This article is part of the supplement: Proceedings of the 2009 AMIA Summit on Translational Bioinformatics .

Open AccessProceedings

Evaluation of a large-scale biomedical data annotation initiative

Ronilda Lacson1 email, Erik Pitzer2 email, Christian Hinske1 email, Pedro Galante3 email and Lucila Ohno-Machado1 email

Decision Systems Group, Brigham & Women's Hospital, Harvard Medical School, Boston, MA, USA

Upper Austria University of Applied Sciences, Hagenberg, Austria

Ludwig Institute for Cancer Research, Sao Paolo Branch, Sao Paulo, Brazil

author email corresponding author email

BMC Bioinformatics 2009, 10(Suppl 9):S10doi:10.1186/1471-2105-10-S9-S10

Published: 17 September 2009

Abstract

Background

This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators.

Results

There were 12,500 samples annotated with approximately 30 variables, in each of six disease categories – breast cancer, colon cancer, inflammatory bowel disease (IBD), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and Type 1 diabetes mellitus (DM). The annotators provided excellent variable coverage, with known values for over 98% of three critical variables: disease state, tissue, and sample type. There was 89% strict inter-annotator agreement and 92% agreement when using semantic and partial similarity measures.

Conclusion

We show that it is possible to perform manual re-annotation of a large repository in a reliable manner.


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.