Table 2

Feature classes remaining after the RFE. Examples for features and feature classes remaining after 64 iterations. In every round, we remove the 10% of all features having the lowest weight. After the 64 iterations, only 0.12% of all features remain. We show the upper, middle, and lower weighted features in this table. High weighted features are more likely to apply to positive samples (NEWGENE), low weighted features to negative samples. Names in bold indicate binary orthographic features and the gazetteer (Gaz), in contrast to single features, like a particular 3-gram. The feature named special in Table 1 actually consists of four parts, two of which are present in the list of top ranking features.

Feature

Class

Weight

Feature

Class

Weight


Gaz

1.497386

AACC

4-gram

0.088738

insulin

Token

0.632708

D2-m

4-gram

-0.022443

protein

Token

0.628168

Stai

4-gram

-0.082046

kinase

Token

0.608392

mig

3-gram

-0.083135

human

Token

0.536695

Reve

4-gram

-0.096548

proteins

Token

0.535368

ing

3-gram

-0.099499

greek

0.498111

GnT

Token

-0.099619

combi

0.489201

owl

3-gram

-0.100996

serum

Token

0.480326

231

Token

-0.104751

lowerUpper

0.457806

ZII

Token

-0.105133

singleCap

0.438028

had

Token

-0.106545

factor

Token

0.438028

we

Token

-0.107104

wild-type

Token

0.389359

[..]

initCaps

0.366269

that

Token

-0.174203

mutants

Token

0.340689

scre

4-gram

-0.175351

genes

Token

0.340352

OH

Token

-0.179445

promoter

Token

0.327395

ims

3-gram

-0.182513

receptor

Token

0.323412

be

Token

-0.186265

polymerase

Token

0.305972

.

Token

-0.188904

complex

Token

0.292019

To

Token

-0.189576

receptors

Token

0.292019

acyc

4-gram

-0.191766

c-myc

Token

0.292019

the

Token

-0.192838

sites

Token

0.243349

off

Token

-0.197588

mutant

Token

0.243349

rank

Token

-0.198915

domain

Token

0.231541

Dar

Token

-0.205479

sequences

Token

0.216691

(

Token

-0.206405

sequence

Token

0.216683

omit

4-gram

-0.220064

domain

Token

0.215116

nost

4-gram

-0.223077

specialnumber

0.205077

spit

4-gram

-0.238335

isoforms

Token

0.194679

allCaps

-0.243183

specialupperCase

0.179926

oped

4-gram

-0.246457

capMixLetters

0.179394

The

Token

-0.246535

[..]

aged

Token

-0.253814

lare

4-gram

0.105354

are

Token

-0.267228

bicu

4-gram

0.103185

ssif

4-gram

-0.272211

bea

3-gram

0.100539

encoding

Token

-0.447471

[

Token

0.097113

which

Token

-0.535368

ntei

4-gram

0.093310

activate

Token

-0.535368

GTTA

4-gram

0.088738

contain

Token

-0.640844


Hakenberg et al. BMC Bioinformatics 2005 6(Suppl 1):S9   doi:10.1186/1471-2105-6-S1-S9

Open Data