Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the 2009 AMIA Summit on Translational Bioinformatics

Open Access Proceedings

Evaluation of a large-scale biomedical data annotation initiative

Ronilda Lacson1*, Erik Pitzer2, Christian Hinske1, Pedro Galante3 and Lucila Ohno-Machado1

Author Affiliations

1 Decision Systems Group, Brigham & Women's Hospital, Harvard Medical School, Boston, MA, USA

2 Upper Austria University of Applied Sciences, Hagenberg, Austria

3 Ludwig Institute for Cancer Research, Sao Paolo Branch, Sao Paulo, Brazil

For all author emails, please log on.

BMC Bioinformatics 2009, 10(Suppl 9):S10  doi:10.1186/1471-2105-10-S9-S10

Published: 17 September 2009



This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators.


There were 12,500 samples annotated with approximately 30 variables, in each of six disease categories – breast cancer, colon cancer, inflammatory bowel disease (IBD), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and Type 1 diabetes mellitus (DM). The annotators provided excellent variable coverage, with known values for over 98% of three critical variables: disease state, tissue, and sample type. There was 89% strict inter-annotator agreement and 92% agreement when using semantic and partial similarity measures.


We show that it is possible to perform manual re-annotation of a large repository in a reliable manner.