Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, 14195 Berlin, Ihnestr. 63-73, Germany

Otto-Warburg Laboratory, Max Planck Institute for Molecular Genetics, 14195 Berlin, Ihnestr. 63-73, Germany

Abstract

Background

Protein-protein interaction networks are key to a systems-level understanding of cellular biology. However, interaction data can contain a considerable fraction of false positives. Several methods have been proposed to assess the confidence of individual interactions. Most of them require the integration of additional data like protein expression and interaction homology information. While being certainly useful, such additional data are not always available and may introduce additional bias and ambiguity.

Results

We propose a novel, network topology based interaction confidence assessment method called CAPPIC (cluster-based assessment of protein-protein interaction confidence). It exploits the network’s inherent modular architecture for assessing the confidence of individual interactions. Our method determines algorithmic parameters intrinsically and does not require any parameter input or reference sets for confidence scoring.

Conclusions

On the basis of five yeast and two human physical interactome maps inferred using different techniques, we show that CAPPIC reliably assesses interaction confidence and its performance compares well to other approaches that are also based on network topology. The confidence score correlates with the agreement in localization and biological process annotations of interacting proteins. Moreover, it corroborates experimental evidence of physical interactions. Our method is not limited to physical interactome maps as we exemplify with a large yeast genetic interaction network. An implementation of CAPPIC is available at

Background

Accurate interaction networks (interactomes) are fundamental to answering questions about how the biochemical machinery of cells organizes matter, processes information, and carries out transformations to perform specific functions leading to various phenotypes. Toward this goal, a number of experimental

Several approaches have been proposed for interaction confidence assessment, many of which are reviewed in

At various levels (globally as well as locally), the topology of interaction networks encodes biological properties which are largely independent of the biochemical function of the individual members of the network

Goldberg and Roth

Here, we propose CAPPIC (cluster-based assessment of protein-protein interaction confidence) – a novel approach that exploits the inherent modular structure of interactomes for confidence assessment of protein-protein interactions. Our method combines the basic principles of the topology based methods described above: high neighborhood interconnectedness of a couple of proteins and short distance between them (the features exploited by Goldberg and Roth and Kuchaiev

We applied our method to six large-scale interaction networks from yeast to assess its performance and compare it to previous topology-based methods (Table

**Table S1. **Interaction data sets merged to construct the Y2H-human network.The table lists the studies that contribute yeast-two-hybrid interactions for the merged Y2H-human network. The file is in XLS format and is viewable e.g. with LibreOffice or Microsoft Excel.

Click here for file

**Table S2.** Properties of the Y2H-human and Mazloom networks. The table shows the properties of the human networks used in the analysis (analogous to Table

Click here for file

network property

Tarassov-all

Tarassov-hq

Yu-Ito-Uetz

Collins

CPDB-yeast

Costanzo

references

method

PCA

PCA

Y2H

AP-MS

multiple

genetic

node count

2238 (2293)

889 (1124)

1647 (2018)

1002 (1620)

6073 (6075)

4278 (4278)

link count

9360 (9646)

2407 (2770)

2518 (2930)

8313 (9064)

74332 (74333)

63927 (63927)

clustering

0.14

0.24

0.08

0.72

0.19

0.06

coefficient

links in

5861 (62%)

1761 (73%)

440 (17%)

8129 (97%)

63385 (85%)

47822 (74%)

triangles

mean shortest

3.7

5.6

5.6

5.5

2.7

2.9

path length

links with ≥ 3

546 (5%)

419 (17%)

598 (23%)

1635 (19%)

6324 (8%)

2546 (3%)

publications

An implementation of CAPPIC is available as a web-based tool called IntScore at

Results

Approach

Assessing protein interaction confidence by random walk interaction clustering

Interaction data are usually modeled as graphs where nodes represent proteins or genes and edges represent interactions between them. For assessing the confidence of every interaction in a network, we apply the following strategy (illustrated in Figure
_{
p,c
} of a protein _{
p,c
}, the number of interactions of protein _{
p,·}, the total number of interactions of _{·,c
}, the total number of interactions in _{·,·}, the total number of interactions in the network:

Outline of our interaction confidence assessment method

**Outline of our interaction confidence assessment method.** In the input interaction network (upper left picture), proteins are labeled with letters (A, B, etc.) and interactions between them are represented by edges. In the first step of the approach, we create the line graph of the given network where nodes represent interactions (labeled A–C, A–D, etc.) and edges represent shared interaction participants. In the second step, we use Markov clustering on this line graph to dissect it into interaction clusters. The clustering granularity is optimized in a previous step of the algorithm. Importantly, proteins can be part of more than one cluster. The relative number of interactions of a protein in a cluster determines how specific a protein is to that cluster. In the third step, we calculate confidence values for every interaction based on how specific both proteins are to the respective clusters. The thickness of interaction links in the lower left picture corresponds to the calculated interaction confidence values for this example network.

The value of the fidelity _{
p,c
}lies between 0 and 1, with values near or equal to 1 if a protein _{
p,c
} it holds that the smaller the cluster (smaller _{·,c
}), the greater the fidelity value. Finally, if all the links of two proteins lie within a cluster, the fidelity is greater for the protein with the higher degree.

We define interaction confidence as the product of the fidelity values of both interacting proteins to the cluster

Interactions get high confidence values if both proteins are specific to the cluster containing the interaction, and low confidence values when one or both of the proteins are not specific to the cluster.

Optimal clustering granularity is reliably determined through partial network rewiring

The interaction confidence scores calculated by CAPPIC are dependent on the granularity of the interaction clustering. It has been previously shown that modules in many complex networks, including protein interaction maps, are organized in a hierarchical manner

Experiments have shown that randomly rewiring 3% of the links in the granularity estimation procedure described above is a good choice because this yields a false interaction set of reasonable size while keeping most of the network intact. If the set of false interactions obtained through random rewiring is too small, the granularity estimation will lack statistical power, while if too many interactions are rewired, the network’s original modular structure will be altered which will affect the granularity estimate. For all networks CAPPIC was applied on, random rewiring of 1%, 3%, 5%, or 10% of the interactions yielded very similar optimal granularity estimates.

Our granularity estimation strategy builds upon the assumption that the optimal granularity value inferred from a partially rewired network instance (where both false positive and false negative rates are increased compared to the real network) is transferable to the real network. We aimed to scrutinize this reasoning and verified for all reference networks that 1) the estimated optimal granularity was rather independent of the random choice of links for rewiring; and 2) that interaction clusters were similar for the intact and the partially rewired networks clustered with the same inflation value (see Additional file

**Supplementary Text.** This file contains additional text and figures demonstrating the validity of the partial random rewiring approach for clustering parameter optimization, as well as text and figures showing that CAPPIC scores can be used for interaction cluster de-noising. The file is in PDF format and is viewable e.g. with Adobe Reader.

Click here for file

True positive interactions are assigned higher confidence than false positives

We measured the performance of CAPPIC and compared it to previously proposed network topology based interaction confidence assessment methods using five yeast physical interaction networks and one genetic interactome map, covering major interaction inference methods (Table

ROC analysis measuring the performance of CAPPIC in comparison to the methods by Goldberg and Roth and Kuchaiev

**ROC analysis measuring the performance of CAPPIC in comparison to the methods by Goldberg and Roth and Kuchaiev ****.** False positive rate (1-specificity) is plotted against true positive rate (sensitivity) for each of the six reference networks. Since the definition of a negative interaction set in the performance assessment involves a random process, the ROC plots summarize the outcome of 100 runs. Plots show the average ROC curves (thick lines), their standard error bands (dotted lines), as well as the mean area under the ROC curve (AUC) of all runs. The ‘X’-marks on the green ROC curves correspond to the fraction of true/false interactions whose proteins share network neighbors and are thus scored by Goldberg and Roth’s method.

In the case of well-studied organisms such as yeast, data on protein complexes can be used to define the positive interaction sets alternatively to literature evidence as used above. We used two complex-based positive sets from yeast complexes obtained from CYC2008

**Figure S1.** ROC plots with complex-based positive reference sets. Receiver operating characteristic analysis results for the yeast reference networks where complex-based positive reference sets have been used. Complexes were obtained from ref.

Click here for file

Cluster based confidence scores corroborate experimental interaction evidence

To compare confidence values calculated by CAPPIC with experiment-based interaction scores, we exploited the fact that some of the interactions in Tarassov-all have been designated high-quality by the authors based on experimental interaction intensity
^{−10}). The high agreement between cluster based interaction confidence scores and experimental interaction weight for the Tarassov-all network was corroborated by a significant Spearman rank correlation between both (^{−5}).

Histogram of confidence scores for interactions in Tarassov-all calculated by our method

**Histogram of confidence scores for interactions in Tarassov-all calculated by our method.** The normalized histograms of interaction confidence scores are shown for the complete Tarassov-all network, as well as for its high-quality (Tarassov-hq) and non-high-quality parts. WRST: Wilcoxon rank sum test of the difference between confidence score distributions of both network parts. Note that the Y-axis is interrupted to better show the differences between the three data sets.

High-confidence interactions are more consistent in biological process and cellular compartment annotation

Interacting proteins are expected to participate in related biological processes and to be co-localized in compartments of the cell

Correlation of CAPPIC interaction confidence with semantic similarity of Gene Ontology co-annotations

**Correlation of CAPPIC interaction confidence with semantic similarity of Gene Ontology co-annotations.** Interactions from every network are ranked by confidence and divided into five equal sized bins (X-axis); for each bin, the average semantic similarity of GO biological process (blue) and cellular component (green) annotations of interacting proteins is shown (Y-axis). Additionally, the pale continuous lines correspond to the mean GO semantic similarity over the complete network rather than the separate bins. The dashed lines reflect the average GO semantic similarity of random pairs of proteins from the network.

Furthermore, if low-confidence interactions are removed from interaction clusters, the latter become more consistent regarding the pathway annotations of the contained proteins (see Additional file

The performance of CAPPIC is consistent between yeast and human networks

To exemplify that the performance of CAPPIC is consistent for different taxonomic species, we also applied it to two human networks: Y2H-human (Additional file
^{−5}). The correlation is negative since interactions with smaller ranks tend to get higher CAPPIC scores. As in the case of the yeast Tarassov-all network described above (that has been obtained by protein-fragment complementation assay), CAPPIC corroborates independent interaction evidence also for this human immuno-precipitation based network.

Performance of CAPPIC on human networks

**Performance of CAPPIC on human networks.****A)** and **C)**: ROC plots for Y2H-human and Mazloom, correspondingly (for details, see Figure
**B)** and **D)**: correlation of CAPPIC scores with GO semantic similarity for Y2H-human and Mazloom, correspondingly (for details, see Figure

Discussion

Network topology-based approaches are motivated by the fact that the structure of interaction networks is not random but reflects biological functionality

CAPPIC compares well to previous topology-based approaches by Goldberg and Roth and Kuchaiev

**Figure S2.** Cluster number and sizes for the yeast reference networks clustered with the optimal granularity. Yeast reference networks were clustered at the optimal inflation value into 10-50 interaction clusters. Here, the cluster sizes in terms of number of interactions (blue line, left-hand-side Y-axis) and number of genes/proteins (green line, right-hand-side Y-axis) per cluster are plotted.

Click here for file

CAPPIC should be applicable for weighting any binary network with an inherent modular structure (for examples, see

Unlike the reference methods, CAPPIC is able to accommodate experimental evidence weights of interactions. Interaction detection techniques often associate such weights with predicted interactions, reflecting for example the number of times an interaction is observed in repetitions of a yeast-two-hybrid experiment

**Figure S3.** Distribution of CAPPIC scores for the hubs PHO85 and UBC7 in comparison to the whole data set.

Click here for file

Our approach can be combined with other lines of interaction evidence like other topological features, protein co-expression, or interaction homology to achieve even better scoring performance

Conclusions

Since biological interaction networks contain false positives, assessing the confidence of individual interactions in order to weight or filter interaction data is a crucial step that should precede network-based inferences. Here we propose a network topology based method called CAPPIC that estimates interaction confidence by exploiting the network’s inherent modularity. CAPPIC requires no reference interaction sets or parameter settings. Based on five large-scale physical interaction networks from yeast, we show that our method compares well to other topology-based approaches. Confidence scores calculated with CAPPIC also correlate well with the Gene Ontology co-annotation of interacting proteins, and corroborate experimental evidence of physical interactions. CAPPIC is limited neither to physical interactome maps nor to yeast networks as it also performs well on a large yeast genetic interaction network and on two human protein-protein interaction data sets.

Methods

Application of Markov clustering algorithm

To cluster a network of interactions, we use the original implementation of the Markov clustering algorithm (version 10-201 downloaded from

Receiver operating characteristic analysis

To conduct ROC analysis, we constructed true and false interaction sets. The positive set comprised interactions published in at least three papers in total. An exception was made for the Costanzo network because of the scarcity of genetic interaction data: the positive set in this case consisted of interactions that are also reported in

Application of reference methods

We set the number of yeast genes to 6,000 in the method by Goldberg and Roth. The parameters of the method by Kuchaiev

Assessing semantic similarity of Gene Ontology annotations

For each network, we obtained the GO semantic similarity of biological process and cellular component annotations of interacting proteins using the method proposed by Resnik

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AK and US conceived the method. AG and RH provided feedback on the method and contributed ideas. AK developed the method and carried out the experiments. AK and US wrote the manuscript, AG and RH provided feedback on the manuscript. All authors read and approved the manuscript.

Acknowledgements

This work was funded by the European Commission under its 7FP grant diXa (grant number 283775) and the German Ministry for Education and Research under its grants MedSys - PREDICT (grant number 0315428A) and NGFNplus (NeuroNet TP3, grant number 01GS08171).