Log on / register
Feedback | Support | My details

This article is part of the supplement: A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications .

Open AccessResearch

Terminologies for text-mining; an experiment in the lipoprotein metabolism domain

Dimitra Alexopoulou1 email, Thomas Wächter1 email, Laura Pickersgill2 email, Cecilia Eyre3 email and Michael Schroeder1 email

Biotechnology Center (BIOTEC), Technische Universität Dresden, Dresden, D-01062, Germany

Unilever Corporate Research, Colworth, MK44 1LQ, UK

Unilever – Safety and Environmental Assurance Centre, Colworth, MK44 1LQ, UK

author email corresponding author email

BMC Bioinformatics 2008, 9(Suppl 4):S2doi:10.1186/1471-2105-9-S4-S2

Published: 25 April 2008

Abstract

Background

The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them.

Results

We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition (ATR) methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods.

Conclusions

Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking ATR results as input and following the guidelines we described.

Availability

The TFIDF term recognition is available as Web Service, described at http:/ / gopubmed4.biotec.tu-dresden.de/ IdavollWebService/ services/ CandidateTermGeneratorService?wsdl webcite


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.