Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

This article is part of the supplement: The International Conference on Intelligent Biology and Medicine (ICIBM) – Genomics

Open Access Research

Identifying the status of genetic lesions in cancer clinical trial documents using machine learning

Yonghui Wu1, Mia A Levy123, Christine M Micheel3, Paul Yeh3, Buzhou Tang1, Michael J Cantrell3, Stacy M Cooreman3 and Hua Xu1*

Author affiliations

1 Department of Biomedical Informatics, Vanderbilt University, School of Medicine, 2209 Garland Ave, Nashville, TN 37232, USA

2 Department of Medicine, Division of Hematology and Oncology, Vanderbilt University, School of Medicine, USA

3 Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, USA

For all author emails, please log on.

Citation and License

BMC Genomics 2012, 13(Suppl 8):S21  doi:10.1186/1471-2164-13-S8-S21

Published: 17 December 2012

Abstract

Background

Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents.

Methods

We developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials).

Results

Our evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task.

Conclusions

We presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials.