Vocabulary fingerprint for FADS1 and its aliases. Schematic description of a group-specific informative vocabulary automatically extracted from a text corpus of PubMed abstracts. In this example, two “synonyms” (green arrows) and one “ambiguous” alias (red arrow) of official gene symbol FADS1 (which encodes the enzyme fatty acid desaturase 1; blue arrow) are distinguished by the algorithm when baseline cut-off was set at c = 0.05. The internal control is the unrelated official gene symbol CLEC2B (black arrow). The Jaccard distances to FADS1 are: 1) D5D = 0.937; 2) fatty acid desaturase 1 = 0.944; 3) TU12 = 1; CLEC2B = 1. Yellow boxes = words from the group-specific informative vocabulary that occur in the text corpora of a given gene symbol or alias.
Coimbra et al. BMC Genomics 2010 11(Suppl 5):S3 doi:10.1186/1471-2164-11-S5-S3