Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Workshop on Advances in Bio Text Mining

Open Access Poster presentation

Species identification for gene name normalization

Illés Solt12*, Domonkos Tikk12 and Ulf Leser1

Author Affiliations

1 Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany

2 Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, H-1117 Budapest, Magyar Tudósok krt 2., Hungary

For all author emails, please log on.

BMC Bioinformatics 2010, 11(Suppl 5):P5  doi:10.1186/1471-2105-11-S5-P5

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/11/S5/P5


Published:6 October 2010

© 2010 Solt et al; licensee BioMed Central Ltd.

Background

Protein interaction networks are expensive to construct experimentally. Therefore, researchers usually refer to the literature or domain-specific databases to convey knowledge on currently known interactions. Yet the task of manual collection of knowledge from scientific papers is labor intensive, and therefore should be automated to the extent possible. For this, an important step is identifying gene and protein names (termed entities). After identification, gene names must be mapped to database identifiers to connect them to structured knowledge. One particular problem in this step are homonymous, i.e., identical names referring to different genes in different species.

Methods

We present different approaches that aim at assigning species labels to MEDLINE abstracts. We use (1) as a baseline, the most frequent species MeSH term of the corresponding journal represented as MeSH terms; (2) the prediction of a binary classifier (SVM) for each species; (3) species names found by the tools Ali Baba [1] or LINNAEUS [2]; (4) the species of a normalized protein mention found by GNAT [3]. For evaluation, we use two sources as gold standard document-level annotations: The MeSH terms from MEDLINE and the species from UniProt and the E. coli-specific RegulonDB via protein- MEDLINE references.

Results

Measurements on a random set of 200 k abstracts from MEDLINE are summarized in Table 1. For MeSH term prediction, the text based methods (Ali Baba, LINNAEUS, GNAT) show stable performance across species, while the classification methods, as they rely on training data, suffer for species with lower prior probability. For the most frequent species human, the bag-of-word based SVM overcomes the difficulty of missing explicit species mention by learning other clues. Using UniProt as gold standard, learning methods produce substantially higher recall, indicating that molecular biology papers are more explicitly mentioning their focus organisms. There is a considerable disagreement between gold standard databases, e.g., only 85.7 % of the papers referenced from a comprehensive E. coli-specific database are annotated as E. coli by MeSH. Reasons for this could be, i.e., incompleteness of MeSH annotations or consideration of orthologs in RegulonDB.

Table 1. Comparison of methods for document-level species annotation

Conclusion

We conclude that there is no one-size-fits-all method for identifying species in abstracts. For less frequent species, direct species mention identification methods work best. The advantage of using indirect clues could only be realized for the most frequent species human, suggesting that machine learning methods should be applied after better balancing the training data. We also showed that using MeSH term queries to filter papers poses considerable limitations on recall.

Acknowledgements

Domonkos Tikk was supported by the Alexander-von-Humboldt Foundation.

References

  1. Plake C, Schiemann T, Pankalla M, Hakenberg J, Leser U: AliBaba: PubMed as a graph.

    Bioinformatics 2006, 22(19):2444-2445. PubMed Abstract | Publisher Full Text OpenURL

  2. Gerner M, Nenadic G, Bergman C: LINNAEUS: A species name identification system for biomedical literature.

    BMC Bioinformatics 2010, 11:85. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  3. Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G: Inter-species normalization of gene mentions with GNAT.

    Bioinformatics 2008, 24(16):126-132. Publisher Full Text OpenURL

  4. Salgado H, Santos-Zavaleta A, Gama-Castro S, Peralta-Gil M, Penaloza-Spinola M, Martinez-Antonio A, Karp P, Collado-Vides J: The comprehensive updated regulatory network of Escherichia coli K-12.

    BMC Bioinformatics 2006, 7:5. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL