Table 1 |
||||
|
Feature classes and their impact prediction quality. Table of all feature classes. *: classes used in the BioCreAtIvE submission, ◦: classes implemented afterwards, partly adopted from other participants of the contest. The forth column gives the impact of each single feature class compared to the baseline (only tokens). This figures include post-processing. The fifth column shows which how precision and recall are affected. Letter surface clues (last rows) refer to the following features: {special, allCaps, initCap, capMix, lowMix, Idl, ddd}. |
||||
|
Feature |
Example |
Short name |
Impact |
|
|
|
||||
|
Token* |
Sro7 |
Token |
= 54% |
- baseline - |
|
Unseen token* |
UToken |
|||
|
n-grams of token* |
1G, 2G, .. |
+15% +14% |
1..4-grams, P+, R++ 1..3-grams |
|
|
Previous & next tokens |
P/NToken |
-5% -6% |
[1,1]-window, P+, R- [2,2]-window |
|
|
n-grams of tokens in window |
2PG/2NG/.. |
|||
|
Prefixes, suffixes |
1P, 2P, 3P, 1S.. |
±0 |
||
|
Stop word |
the, or |
Stop |
-5% -1% -.5% |
10,000 words, P+, R- 1000 words, P+, R- 100 words, P+, R- |
|
POS tag |
NN, DT |
POS |
-50% |
P-, R- |
|
|
||||
|
Initial upper case* |
Msp |
initCap |
+.5% |
P=, R+ |
|
All chars are upper case* |
MMTV |
allCaps |
+.5% |
P-, R+ |
|
Upper case letters* |
InlC, GUS |
Upper |
||
|
Upper case (skip first)* |
MsPRP2 |
Upper2 |
||
|
Single capital |
A |
singleCap |
+.5% |
P+, R+ |
|
Two capitals |
RalGDS |
twoCaps |
+.5% |
P+, R+ |
|
Capital, then mixed letters ◦ |
IgM |
capMix |
||
|
Lower case, then mixed ◦ |
kDa |
lowMix |
+1% |
P-, R+ |
|
Special symbols* |
ICAM-1 |
special |
±0 |
P-, R+ |
|
Characters and numbers* |
p50 |
CharNum |
||
|
Numbers* |
p50, HSF1 |
Number |
||
|
Letters, digits, letters ◦ |
H2kd |
Idl |
±0 |
|
|
Digit, dot, digit ◦ |
5.78 |
ddd |
-.1% |
P-, R- |
|
Greek letter ◦ |
alpha |
greek |
+.5% |
P+, R- |
|
Roman numeral ◦ |
II, xii |
roman |
±0 |
R+, R- |
|
Number followed by '%' ◦ |
75.0% |
percentage |
-.1% |
P-, R- |
|
DNA, RNA sequences ◦ |
ACCGT |
DNA, RNA |
-.1% |
P-, R- |
|
Longest consonant chain * |
Sro7 → 2 |
LCC |
-2% |
P-, R- |
|
|
||||
|
Keyword distance* |
keyDist |
-20% |
P+, R- |
|
|
|
||||
|
Gazetteer* |
Gaz |
-3% |
P-, R- |
|
|
|
||||
|
Prev./next token is NEWGENE |
PTG, NTG |
-18% |
prev. only, P+, R- |
|
|
|
||||
|
Tokens + letter surface clues |
+2% |
P+, R- |
||
|
Tokens + 1,2,3-grams + greek + roman + letter surface clues |
+14% |
P+, R++ |
||
|
Tokens + 1,2,3,-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap * |
+16% |
P+, R++ |
||
|
Tokens + 1,2,3,4-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap* + lowMix ◦ |
+18% |
P+, R++ |
||
|
|
||||
|
Hakenberg et al. BMC Bioinformatics 2005 6(Suppl 1):S9 doi:10.1186/1471-2105-6-S1-S9 |
||||