Table 16

The ten most important features related to difficult (D) and easy (E) classes measured by information gain
Difficult (D) Easy (E)
Rank   Feature name ± IG   Feature name ± IG
1 sentence length (char) 0.0089 label entropy in ST + 0.110
2 label entropy in ST (SP) 0.0086 sentence length (char) + 0.090
3 dep frequency in DG 0.0079 label entropy in DG + 0.089
4 # of proteins in sentence 0.0078 nn frequency in DG 0.081
5 sentence length (word) 0.0069 appos frequency in DG 0.079
6 conj_and frequency in DG 0.0069 conj_and frequency in DG 0.076
7 prep_with frequency in DG 0.0066 dep frequency in DG 0.073
8 prep_with occurrence in DG 0.0066 det frequency in DG 0.069
9 nsubjpass frequency in DG 0.0059 amod frequency in DG 0.063
10 prep_in frequency in DG 0.0057 dobj frequency in DG 0.062

IG – information gain; ST – syntax tree; DG – dependency graph; SP – shortest path. Italic typesetting indicates parsing tree labels. The sign after each feature indicates positive/negative correlation.

Tikk et al.

Tikk et al. BMC Bioinformatics 2013 14:12   doi:10.1186/1471-2105-14-12

Open Data