Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Research article

Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

Antonio Jimeno Yepes12*, Élise Prieur-Gaston3 and Aurélie Névéol45*

Author Affiliations

1 Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA

2 NICTA Victoria Research Lab, Melbourne, VIC, 3010, Australia

3 Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821, France

4 National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA

5 LIMSI-CNRS, rue John von Neumann, Orsay, F-91400, France

For all author emails, please log on.

BMC Bioinformatics 2013, 14:146  doi:10.1186/1471-2105-14-146

Published: 30 April 2013

Abstract

Background

Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain.

Results

We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text.

Conclusions

We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts.

Keywords:
Multilingual corpus generation; Statistical machine translation; Biomedical domain