Table 1

Dataset description

Initial dataset

Dataset with PubMed abstracts

Dataset fulfilling the algorithm’s requirements*

Final dataset (ambiguous aliases excluded)


EntrezGene official symbols

100

73

68**

68

Aliases

425

256

223

165

Abstracts in text corpus

-

13355

12088

9005

Unique PubMed IDs in text corpus

-

11022

10312

7523

Redundancy in text corpus (%)

-

21

16.6

19.7


* The algorithm requires the official gene symbol, and at least one alias and one internal control to produce text corpora of PubMed abstracts. Additionally, the algorithm requires an informative group-specific vocabulary to pass the filters for ubiquitous terms.

** Five official gene symbols, namely DERL3, KCNA7, KCNJ14, MED18, and TBRV4-2, did not fulfil the algorithm’s requirements since their aliases produced no PubMed abstract.

Coimbra et al. BMC Genomics 2010 11(Suppl 5):S3   doi:10.1186/1471-2164-11-S5-S3

Open Data