Table 1

Feature classes and their impact prediction quality. Table of all feature classes. *: classes used in the BioCreAtIvE submission, ◦: classes implemented afterwards, partly adopted from other participants of the contest. The forth column gives the impact of each single feature class compared to the baseline (only tokens). This figures include post-processing. The fifth column shows which how precision and recall are affected. Letter surface clues (last rows) refer to the following features: {special, allCaps, initCap, capMix, lowMix, Idl, ddd}.

Feature

Example

Short name

Impact


Token*

Sro7

Token

= 54%

- baseline -

Unseen token*

UToken

n-grams of token*

1G, 2G, ..

+15%

+14%

1..4-grams, P+, R++

1..3-grams

Previous & next tokens

P/NToken

-5%

-6%

[1,1]-window, P+, R-

[2,2]-window

n-grams of tokens in window

2PG/2NG/..

Prefixes, suffixes

1P, 2P, 3P, 1S..

±0

Stop word

the, or

Stop

-5%

-1%

-.5%

10,000 words, P+, R-

1000 words, P+, R-

100 words, P+, R-

POS tag

NN, DT

POS

-50%

P-, R-


Initial upper case*

Msp

initCap

+.5%

P=, R+

All chars are upper case*

MMTV

allCaps

+.5%

P-, R+

Upper case letters*

InlC, GUS

Upper

Upper case (skip first)*

MsPRP2

Upper2

Single capital

A

singleCap

+.5%

P+, R+

Two capitals

RalGDS

twoCaps

+.5%

P+, R+

Capital, then mixed letters ◦

IgM

capMix

Lower case, then mixed ◦

kDa

lowMix

+1%

P-, R+

Special symbols*

ICAM-1

special

±0

P-, R+

Characters and numbers*

p50

CharNum

Numbers*

p50, HSF1

Number

Letters, digits, letters ◦

H2kd

Idl

±0

Digit, dot, digit ◦

5.78

ddd

-.1%

P-, R-

Greek letter ◦

alpha

greek

+.5%

P+, R-

Roman numeral ◦

II, xii

roman

±0

R+, R-

Number followed by '%' ◦

75.0%

percentage

-.1%

P-, R-

DNA, RNA sequences ◦

ACCGT

DNA, RNA

-.1%

P-, R-

Longest consonant chain *

Sro7 → 2

LCC

-2%

P-, R-


Keyword distance*

keyDist

-20%

P+, R-


Gazetteer*

Gaz

-3%

P-, R-


Prev./next token is NEWGENE

PTG, NTG

-18%

prev. only, P+, R-


Tokens + letter surface clues

+2%

P+, R-

Tokens + 1,2,3-grams + greek + roman + letter surface clues

+14%

P+, R++

Tokens + 1,2,3,-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap *

+16%

P+, R++

Tokens + 1,2,3,4-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap* + lowMix ◦

+18%

P+, R++


Hakenberg et al. BMC Bioinformatics 2005 6(Suppl 1):S9   doi:10.1186/1471-2105-6-S1-S9

Open Data