School of Computer Science and Technology, Xidian University, 710071, PR China

Abstract

Background

Studying protein complexes is very important in biological processes since it helps reveal the structure-functionality relationships in biological networks and much attention has been paid to accurately predict protein complexes from the increasing amount of protein-protein interaction (PPI) data. Most of the available algorithms are based on the assumption that dense subgraphs correspond to complexes, failing to take into account the inherence organization within protein complex and the roles of edges. Thus, there is a critical need to investigate the possibility of discovering protein complexes using the topological information hidden in edges.

Results

To provide an investigation of the roles of edges in PPI networks, we show that the edges connecting less similar vertices in topology are more significant in maintaining the global connectivity, indicating the weak ties phenomenon in PPI networks. We further demonstrate that there is a negative relation between the weak tie strength and the topological similarity. By using the bridges, a reliable virtual network is constructed, in which each maximal clique corresponds to the core of a complex. By this notion, the detection of the protein complexes is transformed into a classic all-clique problem. A novel core-attachment based method is developed, which detects the cores and attachments, respectively. A comprehensive comparison among the existing algorithms and our algorithm has been made by comparing the predicted complexes against benchmark complexes.

Conclusions

We proved that the weak tie effect exists in the PPI network and demonstrated that the density is insufficient to characterize the topological structure of protein complexes. Furthermore, the experimental results on the yeast PPI network show that the proposed method outperforms the state-of-the-art algorithms. The analysis of detected modules by the present algorithm suggests that most of these modules have well biological significance in context of complexes, suggesting that the roles of edges are critical in discovering protein complexes.

Background

Interpretation of the completed biological genome sequences initiated a decade of landmark studies addressing the critical aspects of cell biology on a system-wide level, including gene expression analysis

Protein complexes, consisting of molecular aggregations of proteins assembled by multiple protein interactions, are of the fundamental units of macro-molecular organizations and play crucial roles in integrating individual gene products to perform useful cellular functions. It is confirmed by the fact that the complex 'RNA polymerase II' transcribes genetic information into messages for ribosomes to produce proteins. Unfortunately, the mechanism for most of biological activities is still unknown and hence accurately predicting protein complexes from the available PPI data has a considerable merit of practice because it allows us to infer the principles of biological processes.

The general methods for protein complexes prediction are based on experimental and computational notions. Experimentally, the Tandem Affnity Purification (TAP) with mass spectrometry

Generally, protein interaction data can be effectively modeled as a graph (also called a network) by regarding each protein as a vertex and each known interaction between two proteins as an edge. Although there are plenty of related results in graph theory and many graph algorithms have been developed, it is still non-trivial to design an efficient algorithm to mine protein complexes from PPI networks. One reason is that there has not been an exact definition for a protein complex. To overcome this difficulty, Tong

Although it is non-trivial to design effective and efficient computational methods for predicting complexes, many algorithms have been devoted to the issue. Markov Cluster Algorithm (MCL)

Except the biological information, some newly developed algorithms using the core-attachment structure in complexes revealed by Gavin

An schematic example of core-attachment structure of protein complexes

**An schematic example of core-attachment structure of protein complexes**. An example of the DNA repair complex

The core-attachment based approaches outperform dramatically the available state-of-the-art algorithms, demonstrating the significance of the structure and indicating the critical role of it in discovering protein complexes. This is one of the our major motivations. On the other hand, another major problem confounding the existing computational algorithm is that, available PPI networks are too sparse, for instance, the average numbers of interactions per protein are 5.29, 6.98, and 10.62 in DIP

**Question: **

In this study, we aim to investigate the possibility to extract protein complexes by exploring the roles of edges and develop an affirmative answer to the above question. In detail, similar to the weak ties effects in mobile communication

Materials and methods

The key idea behind our algorithm consists of three main steps: (1) verifying the existence of weak ties effect in PPI networks; (2) constructing a reliable network by exploring the roles of edges; and (3) identifying the protein complexes by using a core-attachment based method. We show them in turns.

Weak ties phenomenon in PPI networks

A network consists of two basic elements: vertices and edges. Many measurements are developed to characterize the role of a node for structure and function including random walk-based indices

Actually, edges in a network usually have two roles to play: some contribute to the global connectivity like the ones connecting two clusters while others enhance the locality like the ones inside a cluster. In social networks, the two roles are reflected as two important phenomena, being respectively the homophily

To investigate the weak ties effects in PPI networks, we quantify how the topological structure changes according to an edge percolation process. In detail, if the weak ties effect exists in terms of topological similarity, the network disintegrates faster when we delete edges successively in an ascending order of the similarity than that in descending order. Similar to _{GC}

where

Prior to studying the weak ties, the bridgeness of an edge should be discussed. In

where (_{u}_{(u,υ) }is the size of the maximal clique containing (

Actually, if (

where _{u\υ}

Similar to Ref. _{GC}_{GC}_{GC}

Edge percolation results on PPI networks

**Edge percolation results on PPI networks**. Plots (a) and (b) are for the topological similarity, while (c-d) and (e-f) are for bridgeness. In (a) and (b), the min- (max-) lines represent the processes where the edges are removed from the least (most) similar to the most (least) similar ones. In (c/e) and (d/f), the min- (max-) lines denote the processes where the edges with smaller (larger) bridgeness based on Eq.(3)/Eq.(2) are removed firstly.

Furthermore, the relation between the topological similarity and bridgeness is also studied. The topological similarity for protein pair is defined as

where ^{k}_{ij }denotes the number of walks of length _{i}_{j}, and

Relation between bridgeness and topological similarity

**Relation between bridgeness and topological similarity**. <

Constructing a reliable network

Gavin

To assess the topological proximity of a core, the measure of proximity of a pair of vertices should be handled beforehand. The most commonly used one is the graph distance, that is, the length of the shortest path connecting the pair of vertices. This quantity, however, is not appropriate for the biological networks largely because of two drawbacks: first, it does not take into account the local structural feature of the networks; second, it is very susceptible to the noises, e.g., a single missing edge effects the proximity, significantly. Thus, vertices connected by paths of various lengthes are likely to be functionality closer than vertices connected via a single path. In detail, give an edge, say (_{1}_{2 }→ _{k+}_{1}, its strength is defined as the product of the weights on each edge in the walk, i.e., _{i, j}_{i}_{i}_{+1}).

Given an un-weighted PPI network, how to assign weights to edges is one of the key steps in our algorithm. As shown in Figure

The larger the bridgeness of an interaction is, the less weight it is.

Now, it is sufficient to deal with the similarity between a pair of proteins via various lengths of walks. (^{k}_{uυ }denotes the sum of strengths of all walks of length

where _{ij }=

For any protein pairs, if the similarity between them is large enough, we have enough reason to believe they should be connected, otherwise, un-connected. Therefore, the proteins among a core should connect each other. To construct a virtual and reliable network for the original **PPI **network, similar to

**Definition 1 **_{τ}, E_{τ}_{τ}=_{τ }= {(_{u,υ},τ

There are two good physic interpretations for Φ(

In this way, the core of a protein complex corresponds to a maximal clique in the virtual network. In the follows, we design algorithm to discover complexes by extracting cores and attachments, respectively.

A core-attachment algorithm

The first task is to extract all the maximal cliques in the virtual network, known as the classic all cliques problem-an NP-hard problem

What we would like to point out is that, although we adopt the same strategy to detect the cores, our algorithm differ greatly from Coach algorithm for two reasons: first, our algorithm detects core in a virtual network based on the weak ties phenomenon, while the Coach on the original network; second, the strategies for the attachment vary greatly.

Given a core denoted by an induced subgraph _{U}, there should be no protein

which quantifies the average closeness of

The procedure can be described as following:

Step 1: Compute the bridgeness for each interaction in PPI network

Step 2: Compute similarity matrix

Step 3: Construct the virtual network Φ(

Step 4: Extract the cores using Protein-complex core mining algorithm

Step 5: Detect the attachments for each core.

Performance measures

The biological significance of the numerically computed modules can be validated by comparing the experimentally determined complexes (will be introduced in result section).

F-measure

Let _{cb}_{cb }= │{_{cp}_{cp }= │{

where

Coverage rate

The coverage rate assesses how many proteins in the real complexes can be covered by the predicted complexes _{ij}

where _{i}

P-value

The

where │

Geometric accuracy

To measure the robustness of the algorithm, the following measures are adopted

where _{i}

Based on

Geometrical separation

Before our description about the geometrical separation, we define

where

where

Results

In this section, the presented algorithm was applied to PPI networks with an immediate purpose to verify the performance from two perspectives: its ability to predict the protein complexes with accuracy, and the robustness of the algorithm. The algorithm was coded using MATLAB version 7.11.

Data

The Database of Interaction Proteins

F-measure and coverage rate

To further verify the novel bridgeness, we proposed two versions of our algorithm: Type I using the bridgeness in Eq.(2), Type II in Eq.(3). The basic information of predictions by various compared algorithms is summarized in Table

The results of various algorithms using DIP data

**MCL**

**DPClus**

**DECAFF**

**Coach**

**Our method-I**

**Our method-II**

Predicted complexes

1116

1143

2190

746

686

620

Covered proteins

4930

2987

1832

1832

1776

1702

_{cp}

193

193

605

285

242

230

_{cb}

242

274

243

249

198

220

Figure

F-measure and Coverage rate

**F-measure and Coverage rate**. The performance comparison for various algorithms on DIP data.

P-value

To further investigate the biological significance of the predicted complexes, the

We discarded all clusters with ^{-2 }for each protein complex because it offers a compromise between complex-cluster matching rate and a clustering passing rate.

Table

Statistical significance of protein complexes obtained by various algorithms on DIP data

**MCL**

**DPClus**

**DECAFF**

**Coach**

**Our method-I**

**Our method-II**

Predicted complexes

1116

1143

2190

746

686

620

Significant complexes

312

352

1653

622

536

519

Proportion (%)

34.2

30.8

75.5

83.4

78.1

83.7

Selected complexes predicted by our method-II on DIP data

**ID**

**Match**

**Predicted complexes**

**Function**

1

90.5%

5.44E-44

YBL002W

YBR009C

YBR154C

YDL140C

DNA-directed RNA polymerase activity

YDL150W

YGL070C

YJR063W

YKL144C

YKR025W

YNL113W

YNR003C

YOR116C

YOR151C

YOR207C

YOR210W

YOR224C

YOR341W

YPR010C

YPR110C

YPR187W

YPR190C

2

94.4%

8.77E-40

YDL150W

YKL144C

YKR025W

YNL151C

RNA polymerase activity

YNR003C

YOR116C

YOR207C

YPR110C

YBL002W

YBR154C

YDR045C

YJR063W

YNL113W

YOR224C

YOR341W

YPR010C

YPR187W YPR190C

3

100%

7.57E-26

YPL138C

YDR469W

YBR175W

YHR119W

histone methyltransferase activity (H3-K4 specific)

YBR258C YAR003W YKL018W YLR015W

4

88.2%

1.49E-20

YBL093C

YBR253W

YDR443C

YNL025C

transcription regulator activity

YNL236W

YOR140W

YBR193C

YCR081W

YDL005C

YER022W

YGL151W

YGR104C

YHR041C YOL051W YOL135C YPL042C YPL248C

5

100%

2.64E-21

Q0085 YBL099W YDR298C YDR377W YJR121W

proton-transporting ATPase activity, rotational mechanism

YKL016C YML081C-A YPL078C YPR020W

Size and density distributions

Because the above experiments are sufficient to prove that the superiority of the proposed bridgeness, we only focused on the Type II method in the forthcoming experiment.

The

Size distribution of predicted complexes

**Size distribution of predicted complexes**. Protein complex size distribution of various method and the benchmark set (A) the benchmark set; (B) the Coach; (C) our algorithm; (D) the DPClus; (E) the MCL.

Notice that our algorithm is quite different from those based on discovering the dense subgraphs because it makes use of the weak ties effect. To verify the difference on the densities of the predicted complexes, we compared the Coach algorithm with our method in terms of the graph densities of the predicted complexes, shown in the Figure

Density distribution of predicted complexes

**Density distribution of predicted complexes**. The comparison on the density of predicted protein complexes from various algorithm.

Effects of the parameters

This subsection is devoted to investigate how the parameters ^{4 }if

Effect of parameter

**Effect of parameter τ**. The plot of the number of edges in the virtual network for various values of

The parameter

Effect of parameter

**Effect of parameter β**. The plot of the F-measure and Coverage rate for different values of

Robustness analysis

The robustness analysis on the proposed algorithm was discussed in this subsection. The benchmark networks adopted here originated from Ref. _{add, del}

In this experiment, only the MCL and Coach algorithms are selected for a comparison. The reason is that it is reported that the MCL is the most robust algorithms

The Figure

Robustness analysis

**Robustness analysis**. In the left panel, each curve denotes the value of accuracy, while that in the right represents the value of separation: (A-B) edge addition to the test graph; (C-D) edge removal from the test graph; (E-F) edge addition to the altered graph with 40% of edges removed randomly; (G-H) edge removal from the altered graph with 40% of edges added randomly.

Figure

Figure

Conclusions

Protein complexes are key and basic molecular units in cellular functions and computational approaches to discovering accurately the unknown protein complexes hidden in the available PPI data are critical need. At present all these computational algorithms focus on the roles of proteins without taking into account the roles of interactions.

In this paper, we investigate the possibility to predict protein complexes with the roles of edges in PPI networks. Firstly, the weak ties phenomenon in the PPI network is proved by using the concept of bridge. Secondly, a reliable and virtual PPI network is constructed making use the relations of topological similarity and bridgeness. Finally, a core-attachment algorithm is designed. The experimental results demonstrate that the roles of edges in biological network is more promising than the roles of proteins, implying the significant importance of the roles of interactions.

The possible future research directions are

Thus, designing effective and efficient methods which can solve these problems will be very important and interesting.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XM designed the study. XM and LG implemented the method, performed the experiments, analyzed the data and wrote the manuscript.

Acknowledgements

This work was supported by the National Key NSFC (Grant No. 60933009&91130006), NSFC (Grant No. 61072103, 61100157&61174162), SRFDPHE (Grant No. 200807010013) and FRFCU(Grant No. K50510030006).

This article has been published as part of