A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

Tikk, Domonkos; Solt, Illés; Thomas, Philippe; Leser, Ulf

doi:10.1186/1471-2105-14-12

Research article
Open access
Published: 16 January 2013

A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

Domonkos Tikk^1,2,
Illés Solt^1,3,
Philippe Thomas¹ &
…
Ulf Leser¹

BMC Bioinformatics volume 14, Article number: 12 (2013) Cite this article

6194 Accesses
17 Citations
1 Altmetric
Metrics details

Abstract

Background

Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.

Results

We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.

Conclusions

Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

Background

Automatically extracting protein-protein interactions (PPIs) from free text is one of the major challenges in biomedical text mining [1-6]. Several methods, which usually are co-occurrence-based, pattern-based, or machine-learning based [7], have been developed and compared using a slowly growing body of gold standard corpora [8]. However, progress always has been slow (if measured in terms of precision / recall values achieved on the different corpora) and seems to have slowed down even over the last years; furthermore, current results still do not cope with the performance that has been achieved in other areas of relationship extraction [9].

In this paper, we want to elucidate the reason of the slow progress by performing a detailed, cross-method study of characteristics shared by PPI instances which many methods fail to classify correctly. We concentrate on a fairly recent class of PPI extraction algorithms, namely kernel methods[10, 11]. The reason for this choice is that these methods were the top-performing in recent competitions [12, 13]. In a nutshell, they work as follows. First, they require a training corpus consisting of labeled sentences, some of which contain PPIs and/or non-interacting proteins, while others contain only one or no protein mentions. All sentences in the training corpus are transformed into structured representations that aims to best capture properties of how the interaction is expressed (or not for negative examples). The representations of protein pairs together with their gold standard PPI-labels are analyzed by a kernel-based learner (mostly an SVM), which builds a predictive model. When analyzing a new sentence for PPIs, its candidate protein pairs are turned into the same representation, then classified by the kernel method. For the sake of brevity, we often use the term kernel to refer to a combination of SVM learner and a kernel method.

Central to the learning and the classification phases is a so-called kernel function. Simply speaking, a kernel function is a function that takes the representation of two instances (here, protein pairs) and computes their similarity. Kernels functions differ in (1) the underlying sentence representation (bag-of-words, token sequence with shallow linguistic features, syntax tree parse, dependency graphs); (2) the substructures retrieved from the sentence representation to define interactions; and (3) the calculation of the similarity function.

In our recent study [14], we analyzed nine kernel-based methods in a comprehensive benchmark and concluded that dependency graph and shallow linguistic feature representations are superior to syntax tree ones. Although we identified three kernels that outperformed the others (APG, SL, kBSPS; see details below), the study also revealed that none of them seems to be a single best approach due to the sensitivity of the methods to various factors—such as parameter settings, evaluation scenario and corpora. This leads to highly heterogeneous evaluation results indicating that methods are strongly prone to over-fit the training corpus.

The focus of this paper is to perform a cross-kernel error analysis at the instance level with the goal to explore possible ways to improve kernel-based PPI extraction. To this end, we determine difficulty classes of protein pairs and investigate the similarity of kernels in terms of their predictions. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. Additionally, we identify kernels that perform better on certain difficulty classes; paving the road to more complex ensembles. We also show that with a generic feature set and linear classifiers a performance can be achieved that is on par with most kernels. However, our main conclusion is pessimistic: Our results indicate that significant progress in the field of PPI extraction probably can only be achieved if future methods leave the beaten tracks.

Methods

We recently performed a comprehensive benchmark of nine kernel-based approaches (hereinafter we refer to them briefly as kernels) [14]. In the meantime, we obtained another four kernels: three of them were originally proposed by Kim et al. ([15]) and one is its modification described in [16]; we refer to them collectively as Kim’s kernels. In this work, we investigate similarities and differences between these 13 kernels.

Kernels

The shallow linguistic (SL) [17] kernel does not use deep parsing information. It is solely based on bag-of-word features (words occurring in the sentence fore-between, between and between-after relative to the pair of investigated proteins), surface features (capitalization, punctuation, numerals), and shallow linguistic (POS-tag, lemma) features generated from tokens left and right to the two proteins (in general: entities) of the protein pair.

Subtree (ST; [18]), subset tree (SST; [19]), partial tree (PT; [20]) and spectrum tree (SpT; [21]) kernels exploits the syntax tree representation of sentences. They differ in the definition of extracted substructures. ST, SST and PT kernels extract subtrees of the syntax parse tree that contain the analyzed protein pair. SpT uses vertex-walks, that is, sequences of edge-connected syntax tree nodes, as the unit of representation. When comparing two protein pairs, the number of identical substructures are calculated as similarity score.

The next group of kernels applies dependency parse sentence representation. Edit distance and cosine similarity kernels (edit, cosine; [22]), as well as the k-band shortest path spectrum (kBSPS; [14]) use primarily the shortest path among the entities, but the latter optionally allows for the k-band extension of this path in the representation. The most sophisticated kernel, all-path graph (APG; [23]) builds both on the dependency graph and the token sequence representations of the entire sentence, and weighs connections within and outside the shortest path differently.

Kim’s kernels [15] also use the shortest path of the dependency parses. The four kernels differ in the information they use from the parses. The lexical kernel uses only lexical information encoded into the dependency tree, that is, nodes are the lemmas of the sentences connected by dependency relation labeled edges. The shallow kernel retains only the POS-tag information in the nodes. The similarity score is calculated by both kernels as the number of identical subgraphs of two shortest paths with the specific node labeling. The combined kernel is the sum of the former two variants. The syntactic kernel, defined in [16], applies exclusively the structural information from the dependency tree, that is, only the edge labels are considered at similarity score calculation.

Since Fayruzov’s implementation of Kim’s kernels does not determine automatically the threshold where to separate positive and negative classes, it has to be specified for each model separately. Therefore, in addition to the parameter search described in [14] and re-used here, we also performed a coarse-grid threshold searching strategy in [0,1] with step 0.05. Assuming that the test corpus has similar characteristic as the training one—the usual guess in the absence of further knowledge—we selected the threshold between positive and negative classes such that their ratio approximated the best the ratio measured on the training set. Note that APG [23] applies a similar threshold searching strategy but optimizes the threshold against F-score on the training set.

Classifiers and parameters

Typically, kernel functions are integrated into SVM implementations. Several freely available and extensible implementations of SVMs exist, among which SVM ^light[24] and LibSVM [25] probably are the most renowned ones. Both can be adapted by supplying a user-defined kernel function. In SVM ^light, kernel functions can be defined as a real function of a pair in the corresponding instance representation. LibSVM, on the other hand, requires the user to pre-compute kernel values, i.e., pass to the SVM learner a matrix containing the pairwise similarity of all instances. Accordingly, most of the kernels we experimented with use the SVM ^light implementation, except for the SL and Kim’s kernels that use LibSVM, and APG that uses internally a sparse regularized least squares (RLS) SVM.

Corpora

We use the five freely available and widely used PPI-annotated resources also described in [8], i.e., AIMed [26], BioInfer [27], HPRD50 [28], IEPA [29], and LLL [30].

Evaluation method

We report on the standard evaluation measures (precision (P), recall (R), F₁-score (F)). As we have shown in our previous study [14], the AUC measure (area under the receiver operating characteristics curve) that is often used in recent literature to characterize classifiers and independent from the distribution of positive and negative classes, depends very much on the learning algorithm of the classifier, and only partially on the kernel. Therefore, in this study we stick to the above three measures, which actually give a better picture on the expected classification performance on new texts. Results are reported in two different evaluation settings: Primarily, we use the document-level cross-validation scheme (CV), which still seems to be the de facto standard in PPI extraction. We also use the cross-learning (CL) evaluation strategy for identifying pairs that behave similarly across various evaluation methods.

In the CV setting, we train and test each kernel on the same corpus using document-level 10-fold cross-validation. We employ the document-level splits used by Airola and many others (e.g., [23, 31, 32]) to allow for direct comparison of results. The ultimate goal of PPI extraction is the identification of PPIs in biomedical texts with unknown characteristics. This task is better reflected in the CL setting, when training and test sets are drawn from different distributions: in such cases, we train on an ensemble of four corpora and test on the fifth one. CL methodology is generally less biased than CV, where the training and the test data sets have very similar corpus characteristics. Note that the difference in the distribution of positive/negative pairs in the five benchmark corpora (ranging from ∼20 to ∼100%) accounts for a substantial part of the diversity of the performance of approaches [8]. Differences in the annotation of corpora not limited to distribution but also deviates in their annotation guidelines and the definition of what constitutes a PPI; those differences are dominantly kept in the standardized format [8] obtained by applying a transformation approach to yield the greatest common factor in annotations.

Experimental setup

For the experimental setup we follow the procedure described in [14]. In a nutshell, we applied entity blinding, resolved entity-token mismatch problems and extended the learning format of the sentences with the missing parses. We applied a coarse-grained grid parameter search and selected the best average setting in terms of the averaged F-score measured across the five evaluation corpora as the default setting for each kernel.

Results and discussion

The main goal of our analysis was to better characterize kernel methods and understand their short-comings in terms of PPI extraction. We started by characterizing protein pairs: we divided them into three classes based on their difficulty. Difficulty is defined by the observed classification success level of kernels. We also manually scrutiny some of the pairs that were found to be the most difficult ones, suspecting that the reason for the failure of kernels is in fact an incorrect annotation. We re-labeled a set of such suspicious annotations and re-evaluated if kernels were able to benefit from these modifications. We also compare kernels based on their predictions by defining kernel similarity as prediction agreement on the instance level. We investigate how kernels’ input representations correlate with their similarity. Finally, to quantify the claimed advantage of kernels for PPI extraction, we compare kernels to more simple methods. We used linear, non-kernel based classifiers and a surface feature set also found in the kernel methods.

Difficulty of individual protein pairs

In this experiment we determine the difficulty of protein pairs. The fewer kernel based approaches are able to classify a pair correctly, the more difficult the pair is. Different kernels’ predictions vary heavily as we have reported in [14]. Here, we show that there exists protein pairs that are inherently difficult to classify (across all 13 kernels), and we investigate whether kernels with generally higher performance classify difficult pairs with greater success.

We define the concept of success level as the number of kernels being able to classify a given pair correctly. For CV evaluation we performed experiments with all 13 kernels, and therefore have success levels: 0,…,13. For CL evaluation, we omitted the very slow PT kernel (0,…,12). Figures 1 and 2 show the distribution of PPI pairs in terms of success level for CV and CL evaluation aggregated across the 5 corpora, respectively. We also show the same statistics for each corpus separately (Tables 1 and 2). Figure 3 shows the correlation between success levels of CV and CL.

Table 1 The distribution of pairs for each corpus according to classification success level using cross-validation setting

A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

Abstract

Background

Results

Conclusions

Background

Methods

Kernels

Classifiers and parameters

Corpora

Evaluation method

Experimental setup

Results and discussion

Difficulty of individual protein pairs

How kernels perform on difficult and easy pairs

Relation between sentence length, entity distance and pair difficulty

Semantic errors in annotation

Similarity of kernel methods

Feature analysis

Non-kernel based classifiers

Conclusions

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us