BioNLP Shared Task - The Bacteria Track

Bossy, Robert; Jourde, Julien; Manine, Alain-Pierre; Veber, Philippe; Alphonse, Erick; van de Guchte, Maarten; Bessières, Philippe; Nédellec, Claire

doi:10.1186/1471-2105-13-S11-S3

Volume 13 Supplement 11

Selected articles from the BioNLP Shared Task 2011

Proceedings
Open access
Published: 26 June 2012

BioNLP Shared Task - The Bacteria Track

Robert Bossy¹,
Julien Jourde¹,
Alain-Pierre Manine²,
Philippe Veber¹,
Erick Alphonse²,
Maarten van de Guchte³,
Philippe Bessières¹ &
…
Claire Nédellec¹

BMC Bioinformatics volume 13, Article number: S3 (2012) Cite this article

7891 Accesses
16 Citations
Metrics details

Abstract

Background

We present the BioNLP 2011 Shared Task Bacteria Track, the first Information Extraction challenge entirely dedicated to bacteria. It includes three tasks that cover different levels of biological knowledge. The Bacteria Gene Renaming supporting task is aimed at extracting gene renaming and gene name synonymy in PubMed abstracts. The Bacteria Gene Interaction is a gene/protein interaction extraction task from individual sentences. The interactions have been categorized into ten different sub-types, thus giving a detailed account of genetic regulations at the molecular level. Finally, the Bacteria Biotopes task focuses on the localization and environment of bacteria mentioned in textbook articles.

We describe the process of creation for the three corpora, including document acquisition and manual annotation, as well as the metrics used to evaluate the participants' submissions.

Results

Three teams submitted to the Bacteria Gene Renaming task; the best team achieved an F-score of 87%. For the Bacteria Gene Interaction task, the only participant's score had reached a global F-score of 77%, although the system efficiency varies significantly from one sub-type to another. Three teams submitted to the Bacteria Biotopes task with very different approaches; the best team achieved an F-score of 45%. However, the detailed study of the participating systems efficiency reveals the strengths and weaknesses of each participating system.

Conclusions

The three tasks of the Bacteria Track offer participants a chance to address a wide range of issues in Information Extraction, including entity recognition, semantic typing and coreference resolution. We found commond trends in the most efficient systems: the systematic use of syntactic dependencies and machine learning. Nevertheless, the originality of the Bacteria Biotopes task encouraged the use of interesting novel methods and techniques, such as term compositionality, scopes wider than the sentence.

Background

Motivation and related work

The extraction of molecular events from the scientific literature is the most popular task in Information Extraction (IE) challenges applied to biology, such as in the LLL [1], BioCreative Protein-Protein Interaction Task [2], or BioNLP [3] challenges. Since the BioNLP 2009 shared task [4], this field has evolved from the extraction of a unique binary interaction relation between proteins and/or genes toward a broader acceptation of biological events, including localization and transformation [5].

The study of bacteria has numerous applications for health, food and industry, and overall, they are considered to be organisms of choice for the recent integrative approaches in systems biology because of their relative simplicity and the extent of the current knowledge. However, the current range of available vhallenges far from reflects the diversity of the potential applications of text mining to biology. The full understanding of a bacterial cell requires a wide range of levels of knowledge among which molecular mechanisms are only one aspect. Microbiologists also require information about the cell life cycle, cell structure, the detailed environment of the bacteria, and its phylogenetic position. The Bacteria Track of the BioNLP 2011 Shared Task gathers three Information Extraction tasks targeted at three different levels of knowledge on bacteria. Thus, we have the first set of IE challenges that are fully dedicated to bacteria and that encompass a wide range of knowledge levels.

At the nomenclatural level, the Bacteria Gene Renaming task challenges the participants to extract gene renaming acts and other gene synonymy mentions from the PubMed abstracts. At the molecular level, the Bacteria Gene Interaction is a more "classic" gene and protein interaction extraction task. Finally, we present the Bacteria Biotopes task, which aims at extracting information about bacteria habitats and biotopes as well as the places that they live.

Bacteria Gene Renaming

Gene renaming is a frequent phenomenon, especially for model bacteria where there has been little to no effort toward the standardization of the nomenclature, and naming conventions are not strictly enforced. Moreover, the history of bacterial gene naming has led to drastic numbers of homonyms and synonyms. For example, many genes of Bacillus subtilis were renamed in the middle of the 1990s, so that the new names matched those of the Escherichia coli homologs.

Hence, the abundance of gene synonyms that are not morphological variants is high compared to eukaryotes. Synonyms are often missing, or erroneous, in gene databases. Specifically, databases often omit old gene names that are no longer used in new publications but that are critical for an exhaustive bibliography search. Polysemy makes the situation even worse because old names frequently happen to be reused to denote different genes. A correct and complete gene synonym table is crucial to biology studies, for example, when integrating large-scale experimental data using distinct nomenclatures. Indeed, this information can save a substantial amount of bibliographic research time. The Rename task is a new task in text-mining for biology, that aims at extracting explicit mentions of renaming relations. The motivation of the Rename task is to keep bacteria gene synonym tables up to date. Additionally, it is a critical step in gene name normalization that is needed for further extraction of biological events such as genic interactions. The goal of the Rename task is illustrated by Figure 1. It consists of predicting renaming relations between text-bound gene names that are given as input. The only type of event is Renaming , for which both arguments are of type Gene. The event is directed, and the former and the new names are distinguished. Genes and proteins were not distinguished because of the high frequency of metonymy in renaming events. In the example of Figure 1, "YtaA", "YvdP" and "YnzH" are the former names of three proteins that were renamed "CotI", "CotQ" and "CotU", respectively.

Bacteria Gene Interactions

Gene and protein interactions are not formulated in the same way for eukaryotes and prokaryotes. Descriptions of interactions and regulations in bacteria include more knowledge about their molecular actors and mechanisms, compared to the literature on eukaryotes. Typically in the bacteria literature, the genic regulations are more likely expressed by the direct binding of the protein, while in the eukaryote literature, non-genic agents related to environmental conditions are much more frequent. The bacteria Gene Interaction task (GI) is based on [6], which is a semantic re-annotation of the LLL challenge corpus [1], for which the description of the GI events in a fine-grained representation includes the distinction between expression, transcription and other action events, as well as different transcription controls (e.g., regulon membership, promoter binding). The entities not only are protein agents and gene targets but also extend to families, complexes and DNA sites (binding sites, promoters) to better capture the complexity of the regulation at a molecular level. The task consists of relating the entities with the relevant relations.

The goal of the GI task is illustrated by Figure 2. The genes "cotB" and "cotC" are related to their two promoters, which are not named here, by the relation PromoterOf. The protein "GerE" is related to these promoters by the relation "BindTo". As a consequence, "GerE" is related to "cotB" and "cotC" by an Interaction relation. According to [5], the need to define specialized relations replacing one unique and general interaction relation was raised in [7] for extracting genic interactions from text. An ontology describes the relations and entities [8], representing a model of gene transcription to which biologists implicitly refer in their publications. Therefore, the ontology is mainly oriented toward the description of a structural model of genes, with molecular mechanisms of their transcription and associated regulations.

Bacteria Biotopes

The Bacteria Biotope (BB) task consists of extracting bacteria location events from Web pages, in other words, citations of places where a given species lives. It is the first step toward linking information on bacteria to ecological information at the molecular level.

According to NCBI statistics, there are nearly 900 bacteria with complete genomes, which account for more than 87% of the total complete genomes. Consequently, molecular studies in bacteriology are shifting from species-centered to full diversity investigations. The current trend in high-throughput experiments targets diversity-related fields, typically phylogeny or ecology. In this context, adaptation properties, biotopes and biotope properties become critical information. Illustrative questions in the field are as follows:

Are some phylogenetic groups specialized to given biotopes?
What are common metabolic pathways of species that live in given conditions, especially species that survive in extreme conditions?
What are the molecular signaling patterns in host relationships or population relationships (e.g., in biofilms)?

Recent metagenomic experiments produce molecular data that are associated with a habitat rather than a single species. This scenario raises new challenges in computational biology and data integration, such as identifying known and new species that belong to a metagenome. Not only will these studies require comprehensive databases that associate bacterial species to their habitat but also they will require a formal description of the habitats for property inferences.

The bacteria biotope description is potentially very rich because any physical object, from a cell to a continent, can be a bacterial habitat. However, these relations are much simpler to model than with general formal spatial ontologies. A given place is a bacterial habitat if the bacteria and the habitat are physically in contact, while the relative position of the bacteria and its dissemination are not of specific interest. The information on bacterial habitats and properties of these habitats is very abundant in the literature, especially in the Systematics literature (e.g., International Journal of Systematic and Evolutionary Microbiology); however, it is rarely available in a structured way [9, 10]. The NCBI GenBank http://www.ncbi.nlm.nih.gov/ nucleotide isolation source field and the JGI Genome OnLine Database ([11]) isolation site field are incomplete with respect to microbial diversity and are expressed in natural language. The two critical missing steps in terms of biotope knowledge modeling are (1) the automatic population of databases with organism/location pairs that are extracted from text, and (2) the normalization of the habitat name with respect to the biotope ontologies. The BB task aims mainly at solving the first information extraction issue. The second classification issue is handled through the categorization of locations into eight broad types.

From a linguistic point of view, the BB task differs from other IE molecular biology tasks while it raises some issues that are common to biomedicine and some of the more general IE tasks. The documents are scientific Web pages that are intended for non-experts such as encyclopedia notices. Documents are structured as encyclopedia pages, with the main focus on a single species or a few species of the same genus or family. The information is dense compared to scientific papers, and the frequency of anaphora and coreferences is unusually high. Location entities can be denoted by named entities, especially geographic locations and bacteria host species names. However, other locations are denoted as noun phrases or adjectives with no clear boundaries.

Methods

Bacteria Gene Renaming

Corpus annotation methodology

The Rename task corpus is a set of 1,644 PubMed references of bacterial genetic and genomic studies, including the title and abstract. Figure 3 presents the most common forms of renaming.

The main intent during the corpus creation process was the enrichment of mentions of gene renaming or gene synonymy; indeed, these mentions are extremely scarce. A first set of 23,000 documents was retrieved, identifying the presence of the bacterium Bacillus subtilis in the text and/or in the MeSH terms. B. subtilis documents are especially rich in renaming mentions.

As a second filtering step, we selected documents using two distinct criteria:

1.
mentions of at least two gene synonyms, as recorded in the fusion of seven B. subtilis gene nomenclatures, leading to a set of 703 documents.
2.
renaming expressions from a list that we manually designed and tested (e.g., "rename", "also known as"). Unexpectedly, these documents contained very few gene renamings, but instead contained renamings concerning other types of biological entities (e.g., protein domains, molecules, cellular ultrastructures). This criterion allowed us to add 941 documents.

Approximately 70% of the documents (1,146) were kept in the training data set. The remainder were split into the development and test sets, containing 246 and 252 documents, respectively. Table 1 gives the distribution of genes and renaming relations per corpus. Gene names were automatically annotated in the documents with the nomenclature of B. subtilis. Gene names involved in renaming acts were manually curated. Among the 21,878 genes mentioned in the three corpuses, 680 unique names are involved in renaming relations, which represent 891 occurrences of genes.

Table 1 Rename corpus size.

Selected articles from the BioNLP Shared Task 2011

BioNLP Shared Task - The Bacteria Track

Abstract

Background

Results

Conclusions

Background

Motivation and related work

Bacteria Gene Renaming

Bacteria Gene Interactions

Bacteria Biotopes

Methods

Bacteria Gene Renaming

Corpus annotation methodology

Prediction evaluation metrics

Bacteria Gene Interaction

Corpus annotation methodology

Prediction evaluation metrics

Bacteria Biotopes

Corpus annotation methodology

Location types

Boundaries

Coreferences

Annotated corpus analysis and annotator agreement

Prediction evaluation metrics

Results

Bacteria Gene Renaming

Bacteria Gene Interaction

Bacteria Biotopes

Bacteria name detection

Location detection and typing

Coreference resolution

Event extraction

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us