Table 3

The overlap of the pairs that are the most difficult and the easiest to classify correctly by the collection of kernels using cross-validation (CV) and cross-learning (CL) settings
Difficulty class Corpus Total
 Difficulty GT  Class/setting AIMed BioInfer HPRD50 IEPA LLL # %
difficult unknown D CV 537 1 077 41 82 39 1776 10.4
D CL 628 1 003 35 99 37 1802 10.6
D =DCVDCL 105 530 8 28 0 671 3.9
p-value 10−10 10−281 10−2 10−8 1.0
positive PD CV 162 281 20 32 17 512 12.2
PD CL 142 319 15 26 16 518 12.3
PD =PDCVPDCL 61 111 2 9 7 190 4.5
p-value 10−60 10−95 10−1 10−7 10−6
negative ND CV 463 610 37 50 39 1199 9.3
ND CV 557 644 32 37 28 1298 10.1
ND =NDCVNDCL 184 295 12 19 11 521 4.0
p-value 10−76 10−204 10−6 10−15 10−4
easy unknown E CV 2137 1870 85 83 36 4211 24.7
E CL 777 2563 45 95 73 3558 20.8
E =ECVECL 464 1017 23 20 4 1528 8.9
p-value 10−45 10−184 10−7 10−3 1.0
positive PE CV 104 301 26 48 36 515 12.3
PE CL 115 364 29 27 22 557 13.3
PE =PECVPECL 49 147 6 10 7 219 5.2
p-value 10−59 10−136 10−3 10−7 10−2
negative NE CV 2105 1752 59 94 23 4033 31.3
NE CL 593 2548 32 87 21 3281 25.5
NE =NECVNECL 440 1014 21 27 8 1510 11.7
p-value 10−88 10−215 10−12 10−7 10−5

We also indicated the size of each set, because they vary depending on the size of success level classes. Abbreviations D, E, PD, ND, PE, and NE refer to the set of difficult (unknown class label), easy (unknown class label), positive difficult, negative difficult, positive easy and negative easy pairs, respectively; GT means ground truth. We highlighted with bold the number pairs in the intersection of CV and CL settings. We show the p-value of Fisher’s independence χ2-test rounded to the closest factor of 10. Bold typesetting indicates that the size of the overlap is too low.

Tikk et al.

Tikk et al. BMC Bioinformatics 2013 14:12   doi:10.1186/1471-2105-14-12

Open Data