Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

A method for automatically extracting infectious disease-related primers and probes from the literature

Miguel García-Remesal12*, Alejandro Cuevas2, Victoria López-Alonso3, Guillermo López-Campos3, Guillermo de la Calle2, Diana de la Iglesia2, David Pérez-Rey12, José Crespo24, Fernando Martín-Sánchez3 and Víctor Maojo12

Author Affiliations

1 Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid. Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain

2 Biomedical Informatics Group, Facultad de Informática, Universidad Politécnica de Madrid. Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain

3 Bioinformatics Unit, Institute of Health Carlos III, Carretera de Majadahonda a Pozuelo Km. 2, 28220 Majadahonda, Madrid, Spain

4 Departamento de Lenguajes y Sistemas Informáticos, Facultad de Informática, Universidad Politécnica de Madrid. Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain

For all author emails, please log on.

BMC Bioinformatics 2010, 11:410  doi:10.1186/1471-2105-11-410

Published: 3 August 2010

Abstract

Background

Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information.

Results

We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name.

Conclusions

We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch.