Open Access Open Badges Research article

Improving protein coreference resolution by simple semantic classification

Ngan Nguyen1*, Jin-Dong Kim2*, Makoto Miwa3, Takuya Matsuzaki1 and Junichi Tsujii4

Author affiliations

1 National Institute of Informatics, Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo, Japan

2 Database Center for Life Science, Yayoi 2-11-16, Bunkyo-ku, Tokyo, Japan

3 The National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK

4 Microsoft Research Asia, 5 Dan Ling Street, Beijing, Haiian District, China

For all author emails, please log on.

Citation and License

BMC Bioinformatics 2012, 13:304  doi:10.1186/1471-2105-13-304

Published: 17 November 2012



Current research has shown that major difficulties in event extraction for the biomedical domain are traceable to coreference. Therefore, coreference resolution is believed to be useful for improving event extraction. To address coreference resolution in molecular biology literature, the Protein Coreference (COREF) task was arranged in the BioNLP Shared Task (BioNLP-ST, hereafter) 2011, as a supporting task. However, the shared task results indicated that transferring coreference resolution methods developed for other domains to the biological domain was not a straight-forward task, due to the domain differences in the coreference phenomena.


We analyzed the contribution of domain-specific information, including the information that indicates the protein type, in a rule-based protein coreference resolution system. In particular, the domain-specific information is encoded into semantic classification modules for which the output is used in different components of the coreference resolution. We compared our system with the top four systems in the BioNLP-ST 2011; surprisingly, we found that the minimal configuration had outperformed the best system in the BioNLP-ST 2011. Analysis of the experimental results revealed that semantic classification, using protein information, has contributed to an increase in performance by 2.3% on the test data, and 4.0% on the development data, in F-score.


The use of domain-specific information in semantic classification is important for effective coreference resolution. Since it is difficult to transfer domain-specific information across different domains, we need to continue seek for methods to utilize such information in coreference resolution.