Open Access Open Badges Research article

Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

Kristina M Hettne123*, André Boorsma4, Dorien A M van Dartel56, Jelle J Goeman7, Esther de Jong58, Aldert H Piersma58, Rob H Stierum4, Jos C Kleinjans1 and Jan A Kors2

Author Affiliations

1 Department of Toxicogenomics, Maastricht University, Maastricht, The Netherlands

2 Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands

3 Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands

4 Microbiology and Systems Biology, TNO, Zeist, The Netherlands

5 Laboratory for Health Protection Research, National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands

6 Human and Animal Physiology, Wageningen University, Wageningen, The Netherlands

7 Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands

8 Institute for Risk Assessment Sciences, Utrecht University, Utrecht, The Netherlands

For all author emails, please log on.

BMC Medical Genomics 2013, 6:2  doi:10.1186/1755-8794-6-2

Published: 29 January 2013



Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles.


We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals.


Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals.


Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.

Text mining; Toxicogenomics; Gene set analysis