Intelligent Systems, College of Information Technology, UAEU, Al Ain, UAE

Faculty of Mechanics and Mathematics, Moscow State Uni., Moscow, Russia

Abstract

Background

Predicting protein complexes from protein-protein interaction data is becoming a fundamental problem in computational biology. The identification and characterization of protein complexes implicated are crucial to the understanding of the molecular events under normal and abnormal physiological conditions. On the other hand, large datasets of experimentally detected protein-protein interactions were determined using High-throughput experimental techniques. However, experimental data is usually liable to contain a large number of spurious interactions. Therefore, it is essential to validate these interactions before exploiting them to predict protein complexes.

Results

In this paper, we propose a novel graph mining algorithm (PEWCC) to identify such protein complexes. Firstly, the algorithm assesses the reliability of the interaction data, then predicts protein complexes based on the concept of weighted clustering coefficient. To demonstrate the effectiveness of the proposed method, the performance of PEWCC was compared to several methods. PEWCC was able to detect more matched complexes than any of the state-of-the-art methods with higher quality scores.

Conclusions

The higher accuracy achieved by PEWCC in detecting protein complexes is a valid argument in favor of the proposed method. The datasets and programs are freely available at

Background

Protein complexes are groups of associated polypeptide chains whose malfunctions play a vital role in disease development
^{
in
}(^{
bound
}(

where _{
u
} and _{
v
} are the numbers of neighbors of proteins

where

Equations (1) and (2), show how many 3-cliques can be generated from the interactions between proteins

where ^{0}(^{0}(^{
th
} step; ^{1}(^{
k
}(

Reliability weight of the edge

**Reliability weight of the edge ****1 using AdjstCD depends on the outgoing edges ****6,****7, …,****10.** However, in a case of noisy network there is a possibility that many of these outgoing edges may not be reliable. Therefore, the reliability of the edge

In this paper, we propose a simple yet effective method for protein complex identification. We are aware of the fact that, in addition to improving graph mining techniques, it is necessary to obtain high quality benchmarks by assessing protein interaction reliability. Therefore, we propose a novel method for assessing the reliability of interaction data and detecting protein complexes. Unlike CMC, this method finds near-maximum cliques (maximal cliques without unreliable interactions). We employ the concept of weighted clustering coefficients as a measure to define which subgraph is the closest to the maximal clique. The clustering coefficient of a vertex in this case is the density of its neighborhood

Methods

Computational approaches for detecting protein complexes from PPI data are useful complements to the limitation of the experimental methods such as Tandem Affinity Purification (TAP)

Assessing the reliability of protein interactions

In this section we introduce the PE-measure, a new measure for protein pairs interaction reliability. PE-measure enables us to reduce the level of noise associated with PPI networks and it is defined as follows:

Given a PPI network with _{0})_{
i
j
} of the initial (_{0} are equal to 0.5 (given that _{
k
})_{
ij
} of the matrix _{
k
} in

where we take the product by all _{
l
} : (_{
l
},_{
i
}) ∈ _{
l
},_{
j
}) ∈

To illustrate the weighting scheme, consider a hypothetical network as shown in Figure

A simple hypothetical network of 5 proteins and 6 interactions to illustrate how the weight of the edge _{1} is determined

**A simple hypothetical network of 5 proteins and 6 interactions to illustrate how the weight of the edge **
**
e
**

Suppose we would like to determine the weight of the edge _{1} (between protein 1 and protein 2). According to Equation (4), the probabilities that protein 3 and protein 4 do not “support” the edge _{1} are (1−_{1,3}·_{2,3}) and (1−_{1,4}·_{2,4}), respectively. Thus, the probability that protein 3 and 4 do not “support” the edge _{1} is (1−_{1,3}·_{2,3})·(1−_{1,4}·_{2,4}). Therefore, the probability that protein 1 and protein 2 interact (and supported by protein 3 and protein 4) is the complementary probability 1−[(1−_{1,3}·_{2,3})·(1−_{1,4}·_{2,4})].

We start with the initial probability matrix _{0} (where _{1,3}, _{2,3}, _{2,4}, _{1,4} and _{3,5} are all equal to 0.5). In the first iteration (_{1} is
_{2}, _{3}, _{4} and _{5} are all equal to
_{6} is equal to 0. All of the PE-measures are updated before the second iteration (

For each protein in the PPI network, we calculate the average PE-measures (_{
avg
})_{
i
} of all outgoing edges as follows:

where _{
l
} : (_{
l
},_{
i
}) ∈ _{
i
} is the number of the neighbors of _{
i
} and _{
il
} is less than the average (_{
avg
})_{
i
} then the edge between proteins

Applying Equation (4) on the hypothetical network shown in Figure
_{6} yields a lower weight which is equal to 0 and therefore, it could be a noise and should be removed from the network.

Detecting protein complex using weighted clustering coefficient

For each protein _{
i
} in the PPI network, we first create the neighborhood graph, calculate the weighted clustering coefficient and then calculate the degree of each node in the neighborhood graph; the “degree” of a node being the number of its neighbors. The weighted clustering coefficient _{
i
} in this case is calculated according to the following formula:

where _{3cliques
} is the number of 3-cliques in the neighborhood graph. Once the degree is calculated, we sort the sequence of proteins in the neighborhood graph accordingly from minimum to maximum. The protein _{
j
} with the lowest degree and its corresponding interactions are removed from the neighborhood graph and _{
i
} is recalculated. This process stops when the neighborhood graph contains only 3 proteins and the sequence of proteins with the highest _{
i
} is returned as a valid core protein complex. This concept is illustrated in Figure

Illustration of how a protein complex is detected: (a) A simple hypothetical network of 6 proteins and 12 interactions, (b) based on the sequence of the degree, node 5 has only 2 outgoing connections and therefore, it is removed from the protein network, (c) based on the sequence of the degree, node 3 is removed and therefore, the subgraph which contains the central protein 1 and three nodes (2,4 and 6) remains as a valid core protein complex, (d) protein which interacts with more than 50% such as protein 3 rejoins the protein network and the final complex is predicted

**Illustration of how a protein complex is detected: (a) A simple hypothetical network of 6 proteins and 12 interactions, (b) based on the sequence of the degree, node 5 has only 2 outgoing connections and therefore, it is removed from the protein network, (c) based on the sequence of the degree, node 3 is removed and therefore, the subgraph which contains the central protein 1 and three nodes (2,4 and 6) remains as a valid core protein complex, (d) protein which interacts with more than 50% such as protein 3 rejoins the protein network and the final complex is predicted.**

In Figure
_{1} in this case is equal to 5 (the central protein 1 is not considered), _{3cliques
} = 7 and therefore, according to Equation 6,
_{3cliques
} = 5 and therefore, _{1} = 0.21. Based on the sequence of the degree there exists a tie and therefore either nodes 3 or 4 should be randomly removed. If the node 3 is removed as shown in Figure
_{1} in this case is equal to 0.33 and therefore, the subgraph which contains the central protein 1 and three nodes (2, 4 and 6) is a valid core protein complex. Once the core protein complex is identified, we examine the main subgraph once again and re-join any protein which interacts with more than

Assessing the quality of predicted complexes

To evaluate the accuracy of the proposed method, we used the Jaccard index which defined as follows:

where _{
K
} and _{
R
} are the set of proteins in

To estimate the cumulative quality of the prediction, assume a set of reference complexes _{1},_{2},…,_{
n
}} and a set of predicted complexes _{1},_{2},…,_{
m
}} the recall (

and

Following Brohee and van Helden
_{
ij
}] of the complexes. Given _{
ij
} denote the number of proteins that are found both in reference complex _{
i
} denote the number of proteins in reference complex

and

Since

Following Nepusz et al.

**The algorithm to calculate the MMR.**

Click here for file

The experimental works were conducted on a PC with Intel(R) Core(TM)2, CPU 6400 @ 2.13GHz and 3 GB of RAM.

Results and discussion

In this section, we first describe the datasets and evaluate the current methods for protein complex detection, and then study the performance of PEWCC and the impact of the PE-measure. The effectiveness of our method is evaluated using two different PPI datasets. The first is a combined PPI dataset (PPI-D1) developed by Liu et al.

**Dataset**

**Proteins**

**Interactions**

**Network density**

**Clustering coefficient**

**Av. no. of neighbors**

**Isolated proteins**

PPI-D1

3,869

19,165

0.002

0.157

8.957

8

PPI-D2

5,640

59,748

0.004

0.246

21.187

0

**The summary of the parameters setup.**

Click here for file

Three reference sets of protein complexes are used in these experiments. The first set of complexes (Cmplx-D1) comprises of 162 hand-curated complexes from MIPS

In the first experimental work, we attempted to find the optimal value of the re-join parameter

Measuring the effect of varying the values of the re-join parameter (

**Measuring the effect of varying the values of the re-join parameter (****) in terms of ****.** For

In Table

**Cmplx-D1**

**Cmplx-D2**

**Method**

**Matched Cmplx**

**
Prec
**

**
Rec
**

**F1**

**Matched Cmplx**

**
Prec
**

**
Rec
**

**F1**

PEWCC

58

0.435

0.469

0.451

61

0.468

0.910

0.618

CMC

56

0.297

0.346

0.320

57

0.385

0.889

0.537

ClusterONE

52

0.204

0.387

0.267

48

0.231

0.872

0.365

MCL

51

0.353

0.315

0.333

52

0.448

0.825

0.581

MCODE

39

0.330

0.241

0.279

34

0.386

0.540

0.450

CFilter

46

0.379

0.284

0.325

43

0.463

0.683

0.552

As shown in Table

To analyze the performance of PEWCC, ClusterONE and CMC in a noisy interaction dataset, we added different random sets of interaction pairs to Cmplx-D1 (1000 PPI pairs at a time). In Figure

Comparing PEWCC, ClusterONE and CMC in the presence of additional sets of random PPI pairs in terms of the number of matched complexes detected, F1, PPV and

**Comparing PEWCC, ClusterONE and CMC in the presence of additional sets of random PPI pairs in terms of the number of matched complexes detected, F1, PPV and **
**
MMR
**

Furthermore, the impacts of the PE-measure and the AdjstCD measure on improving the detection of matched complexes were assisted using the datasets PPI-D1 and Cmplx-D1. In Table

**Method**

**Clusters predicted**

**Matched Cmplx**

**Perc. of successful Cmplx**

**
Rec
**

**
Prec
**

**
PPV
**

**
F
**

CMC

133

45

28

0.217

0.263

0.172

0.238

ClusterONE

498

77

47.5

0.372

0.118

0.301

0.180

AdjstCD+CMC

127

75

46.3

0.362

0.455

0.277

0.404

AdjstCD+ClusterONE

139

78

48.2

0.377

0.393

0.294

0.385

PE+CMC

112

77

47.5

0.372

0.446

0.313

0.406

PE+ClusterONE

110

81

50

0.391

0.464

0.318

0.424

PE+WCC

128

89

54.9

0.435

0.469

0.262

0.451

For generalization purposes PEWCC was further compared to several state-of-the-art methods based on the protein interaction dataset PPI-D2 and the reference dataset Cmplx-D3. PPI-D2 and Cmplx-D3 were recently published and used to evaluate the performance of ClusterONE

As shown in Table

**Method**

**Clusters predicted**

**Matched Cmplx**

**Perc. of successful Cmplx**

**
Sn
**

**
PPV
**

**
Acc
**

**MMR**

PEWCC

468

122

60.1

0.551

0.430

0.491

0.348

ClusterONE

473

88

43.3

0.454

0.427

0.440

0.195

RNSC

209

79

38.9

0.399

0.441

0.419

0.192

RRW

253

75

36.9

0.276

0.429

0.344

0.178

CMC

73

53

26.1

0.323

0.404

0.487

0.176

MCL

338

37

18.2

0.346

0.350

0.348

0.083

MCODE

85

21

10.3

0.285

0.284

0.285

0.048

Conclusion

In this paper, we have provided a novel method (PEWCC) for detecting protein complexes from a PPI network of yeast. We have shown that our approach, which first assesses the quality of the interaction data and then detect the protein complex based on the concept of weighted clustering coefficient, is more accurate than most of the well known methods.

The noise associated with the PPI network and the focus on dense subgraphs have restricted researchers from creating an effective algorithm that is capable of identifying small complexes and PEWCC is no exception. In fact, we cannot recall any method that can effectively detect complexes (≤ 3 proteins) using only the topology of the PPI network. We understand that PEWCC stops when the neighborhood graph contains only 3 proteins which restricts it from identifying small complexes (≤ 3 proteins). It was possible for us to discover the clustering coefficient was _{
i
} = 1 for dense graphs of size 3 (with 3 nodes and 3 edges) and _{
i
} = 0 for other subgraphs of size 3 (with 3 nodes and 2 edges). We are currently conducting a systematic research of nested complexes (the case where one complex is a sub-complex of a bigger one) in order to identify strategies that could be useful in improving the capability of PEWCC in identifying small complexes.

The performance of PEWCC can also be tested when the edges were randomly removed from the original graph. However, we strongly believe that the main issue concerning PPI data is the noise associated with false interactions (edges). There are many interactions that are not reliable and by removing them, the prediction accuracy was improved by using PE measure and AdjstCD. Moreover, if we remove edges uniformly over the PPI network, then the PEWCC algorithm will still work, because it calculates relative density (one subgraph with respect to another). It means that if we have two subgraphs _{1} and _{2} and the density of _{1} is less than the density of _{2}, then following the random deletion of some edges from _{1} and _{2}, the probability that the density of _{1} will be less than the density of _{2}, will still be very high.

In the future, we would like to compare the performance of PE to the recently published novel weighting schemes for noise reduction in PPI network by graphs by Kritikos et al.

Furthermore, the idea of decomposing the PPI network into overlapping clusters will be explored as it shows great potential in recent works

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

NZ and DF designed the method and conceived the study. JB implemented the method. NZ performed the experiments and wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to acknowledge the assistance provided by the Emirates Foundation (EF Grant Ref. No. 2010/116), the National Research Foundation (NRF Grant Ref. No. 21T021) and the Research Support and Sponsored Projects Office and the Faculty of Information Technology at the United Arab Emirates University (UAEU).