Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007

Open Access Proceedings

Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction

Julien Gobeill12, Imad Tbahriti12, Frédéric Ehrler1, Anaïs Mottaz2, Anne-Lise Veuthey2 and Patrick Ruch1*

Author Affiliations

1 University and Hospitals of Geneva, Geneva, Switzerland

2 Swiss-Prot Research Group, Swiss Institute of Bioinformatics, Geneva, Switzerland

For all author emails, please log on.

BMC Bioinformatics 2008, 9(Suppl 3):S9  doi:10.1186/1471-2105-9-S3-S9

Published: 11 April 2008



This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases.


Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%).


Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.