This article is part of the supplement: The Third BioCreative – Critical Assessment of Information Extraction in Biology Challenge

Open Access Open Badges Research

The gene normalization task in BioCreative III

Zhiyong Lu1*, Hung-Yu Kao2, Chih-Hsuan Wei2, Minlie Huang3, Jingchen Liu3, Cheng-Ju Kuo4, Chun-Nan Hsu45, Richard Tzong-Han Tsai6, Hong-Jie Dai78, Naoaki Okazaki9, Han-Cheol Cho10, Martin Gerner11, Illes Solt12, Shashank Agarwal13, Feifan Liu13, Dina Vishnyakova14, Patrick Ruch15, Martin Romacker16, Fabio Rinaldi17, Sanmitra Bhattacharya18, Padmini Srinivasan18, Hongfang Liu19, Manabu Torii20, Sergio Matos21, David Campos21, Karin Verspoor22, Kevin M Livingston22 and W John Wilbur1*

Author Affiliations

1 National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland 20894, USA

2 Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C

3 Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China

4 Institute of Information Science, Academia Sinica, Taipei 115, Taiwan

5 Information Science Institute, University of Southern California, Marina del Rey, California, USA

6 Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, R.O.C

7 Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan, R.O.C

8 Institute of Information Science, Academic Sinica, Taipei, Taiwan, R.O.C

9 Interfaculty Initiative in Information Studies, University of Tokyo, Japan

10 Graduate School of Information Science and Technology, University of Tokyo, Japan

11 Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK

12 Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary

13 Medical Informatics, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA

14 BiTem Group, Division of Medical Information Sciences, University of Geneva, Switzerland

15 BiTeM Group, Information Science Department, University of Applied Science, Geneva, Switzerland

16 NITAS/TMS, Text Mining Services, Novartis AG, Switzerland

17 Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland

18 Department of Computer Science, The University of Iowa, Iowa City, Iowa 52242, USA

19 Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN 55905 USA

20 Lab of Text Intelligence in Biomedicine, Georgetown University Medical Center, 4000 Reservoir Rd., NW, Washington, DC 20057 USA

21 DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal

22 Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA

For all author emails, please log on.

BMC Bioinformatics 2011, 12(Suppl 8):S2  doi:10.1186/1471-2105-12-S8-S2

Published: 3 October 2011



We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).


We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.


By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.