Species identification for gene name normalization

Solt, Illés; Tikk, Domonkos; Leser, Ulf

doi:10.1186/1471-2105-11-S5-P5

Volume 11 Supplement 5

Workshop on Advances in Bio Text Mining

Poster presentation
Open access
Published: 06 October 2010

Species identification for gene name normalization

Illés Solt^1,2,
Domonkos Tikk^1,2 &
Ulf Leser¹

BMC Bioinformatics volume 11, Article number: P5 (2010) Cite this article

2433 Accesses
1 Citations
Metrics details

Background

Protein interaction networks are expensive to construct experimentally. Therefore, researchers usually refer to the literature or domain-specific databases to convey knowledge on currently known interactions. Yet the task of manual collection of knowledge from scientific papers is labor intensive, and therefore should be automated to the extent possible. For this, an important step is identifying gene and protein names (termed entities). After identification, gene names must be mapped to database identifiers to connect them to structured knowledge. One particular problem in this step are homonymous, i.e., identical names referring to different genes in different species.

Methods

We present different approaches that aim at assigning species labels to MEDLINE abstracts. We use (1) as a baseline, the most frequent species MeSH term of the corresponding journal represented as MeSH terms; (2) the prediction of a binary classifier (SVM) for each species; (3) species names found by the tools Ali Baba [1] or LINNAEUS [2]; (4) the species of a normalized protein mention found by GNAT [3]. For evaluation, we use two sources as gold standard document-level annotations: The MeSH terms from MEDLINE and the species from UniProt and the E. coli-specific RegulonDB via protein- MEDLINE references.

Results

Measurements on a random set of 200 k abstracts from MEDLINE are summarized in Table 1. For MeSH term prediction, the text based methods (Ali Baba, LINNAEUS, GNAT) show stable performance across species, while the classification methods, as they rely on training data, suffer for species with lower prior probability. For the most frequent species human, the bag-of-word based SVM overcomes the difficulty of missing explicit species mention by learning other clues. Using UniProt as gold standard, learning methods produce substantially higher recall, indicating that molecular biology papers are more explicitly mentioning their focus organisms. There is a considerable disagreement between gold standard databases, e.g., only 85.7 % of the papers referenced from a comprehensive E. coli-specific database are annotated as E. coli by MeSH. Reasons for this could be, i.e., incompleteness of MeSH annotations or consideration of orthologs in RegulonDB.

Table 1 Comparison of methods for document-level species annotation

Full size table

Conclusion

We conclude that there is no one-size-fits-all method for identifying species in abstracts. For less frequent species, direct species mention identification methods work best. The advantage of using indirect clues could only be realized for the most frequent species human, suggesting that machine learning methods should be applied after better balancing the training data. We also showed that using MeSH term queries to filter papers poses considerable limitations on recall.

References

Plake C, Schiemann T, Pankalla M, Hakenberg J, Leser U: AliBaba: PubMed as a graph. Bioinformatics 2006, 22(19):2444–2445. 10.1093/bioinformatics/btl408
Article CAS PubMed Google Scholar
Gerner M, Nenadic G, Bergman C: LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 2010, 11: 85. 10.1186/1471-2105-11-85
Article PubMed Central PubMed Google Scholar
Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G: Inter-species normalization of gene mentions with GNAT. Bioinformatics 2008, 24(16):126–132. 10.1093/bioinformatics/btn299
Article Google Scholar
Salgado H, Santos-Zavaleta A, Gama-Castro S, Peralta-Gil M, Penaloza-Spinola M, Martinez-Antonio A, Karp P, Collado-Vides J: The comprehensive updated regulatory network of Escherichia coli K-12. BMC Bioinformatics 2006, 7: 5. 10.1186/1471-2105-7-5
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

Domonkos Tikk was supported by the Alexander-von-Humboldt Foundation.

Author information

Authors and Affiliations

Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Illés Solt, Domonkos Tikk & Ulf Leser
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, H-1117, Budapest, Magyar Tudósok krt 2., Hungary
Illés Solt & Domonkos Tikk

Authors

Illés Solt
View author publications
You can also search for this author in PubMed Google Scholar
Domonkos Tikk
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Leser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Illés Solt.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Solt, I., Tikk, D. & Leser, U. Species identification for gene name normalization. BMC Bioinformatics 11 (Suppl 5), P5 (2010). https://doi.org/10.1186/1471-2105-11-S5-P5

Download citation

Published: 06 October 2010
DOI: https://doi.org/10.1186/1471-2105-11-S5-P5

Workshop on Advances in Bio Text Mining

Species identification for gene name normalization

Background

Methods

Results

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Workshop on Advances in Bio Text Mining

Species identification for gene name normalization

Background

Methods

Results

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us