School of Information Science and Engineering, Central South University, Changsha 410083, China

Department of Computer Science, Georgia State University, Atlanta, GA30303-4110, USA

Abstract

Background

Identification of protein complexes in large interaction networks is crucial to understand principles of cellular organization and predict protein functions, which is one of the most important issues in the post-genomic era. Each protein might be subordinate multiple protein complexes in the real protein-protein interaction networks. Identifying overlapping protein complexes from protein-protein interaction networks is a considerable research topic.

Result

As an effective algorithm in identifying overlapping module structures, clique percolation method (CPM) has a wide range of application in social networks and biological networks. However, the recognition accuracy of algorithm CPM is lowly. Furthermore, algorithm CPM is unfit to identifying protein complexes with meso-scale when it applied in protein-protein interaction networks. In this paper, we propose a new topological model by extending the definition of

Conclusion

The proposed algorithm CP-DR based on clique percolation and distance restriction makes it possible to identify dense subgraphs in protein interaction networks, a large number of which correspond to known protein complexes. Compared to algorithm CPM, algorithm CP-DR has more outstanding performance.

Background

With the Human Genome Project implement successfully, the biomedical research enters the post-genome era. In the new era, one of the most important challenges is to systematically analyze and comprehensively understand how the proteins accomplish the life activities by interacting with each other

The basic idea of Non-overlapping Clustering Algorithms is that each protein belongs to one and only one protein complex in large-scale protein-protein interaction network. King

In recent years, a variety of algorithms extend the G-N algorithm could be employed to analyze the overlapping structures of the large-scale complex networks, including protein-protein interaction networks. The representative algorithms are Cluster-Overlap Newman Girvan Algorithm (CONGA) ^{3})

With specialized research in the overlapping structure of large-scale complex network, a powerful algorithm for finding protein complexes and exploring the general characteristics of complex networks in biology based on clique percolation has been recently developed by Palla

The Proposed Algorithm

Algorithm CPM

Palla

A simple illustration of the extraction of the

**A simple illustration of the extraction of the k-clique-communities at k = 4 using the clique-clique overlap matrix**

As is known to all, the result of algorithm CPM associated closely with the value of clique percolation parameter

Algorithm CP-DR

In recent years, some researches have found that most important biological processes such as signal transduction, cell-fate regulation, transcription and translation involve more than four but much fewer than hundreds of proteins. Most relevant processes in biological networks correspond to the meso-scale (5-25 genes or proteins)

The protein complex identified by algorithm CPM and the real protein complexes.

**The protein complex identified by algorithm CPM and the real protein complexes.** The left panel Fig.

In algorithm CPM, each

Our new topological structure of identified clusters is based on the observation that a typical member in a cluster is linked to many other members, but not necessary to all other vertices in the cluster. In other words, our new topological structure of identified cluster can be interpreted as a union of small complete (fully connected) subgraphs that share vertices. We could definition the identified cluster as the union of all maximal cliques that satisfying the distance restriction and that can be reached from each other through a series of adjacent maximal clique (where two maximal cliques are said to be adjacent if they share

In the following discussion, we donate by _{c(U, V)}_{l(U, V)}_{c(U, V)}_{l(U, V)},

Condition 1

In the definition of our new topological model, the identified cluster could be seen as the union of all maximal cliques that can be reached from each other through a series of adjacent maximal clique (where two maximal cliques are said to be adjacent if they share

_{c(U, V)} ≥

where

Condition 2

In our new topological model, the identified cluster also should be satisfying the distance restriction. As mentioned above, the distance is represented by the diameter of the identified cluster. According to the small-world property of the protein interaction networks

_{l(U, V)} ≤

where

It is known to all from previous subsection that two

where

In literature

According to the detailed depiction in characteristics of our new topological model, we propose a novel algorithm called CP-DR (Clique Percolation Method based on Distance Restriction) for identifying protein complexes based on clique percolation and distance restriction. The description of algorithm CP-DR is shown in Fig.

The description of algorithm CP-DR.

**The description of algorithm CP-DR.** The algorithm CP-DR (Clique Percolation Method based on Distance Restriction) based on a new topological model by extending the definition of

As shown in Fig.

In step 1 of algorithm CP-DR, the time complexity of protein-protein interaction information transformed into undirected simple graph is ^{2}s^{3}^{2}s^{3}

Results and Discussions

To evaluate the suitability and validity of our proposed algorithm in identifying the overlapping protein complex in protein-protein interaction networks, we have used C++ language to implement algorithm CP-DR and download the overlapping protein complexes identification tool CFinder from

In the following subsections, we will compare the predicted clusters with the known complexes, analyze the

Comparison with the known complexes

To evaluate the effectiveness of the algorithm CP-DR in detecting protein complexes, we compare the predicted overlapping structures produced by this algorithm with known protein complexes in MIPS yeast complex database. There are 216 manually annotated complexes considered as the gold standard data that each consists of two or more proteins. The largest complex contains 81 proteins, the smallest complex contains 2 proteins, and the average size of all the complexes is 6.31. Here, we use the same scoring scheme used in

where _{Pc}|_{Kc}

A known complex

The numbers of matched known complexes with respect to different overlapping scores threshold (from 0 to 1 with a 0.1 increment) for result data sets generated by algorithm CPM using different parameter values and algorithm CP-DR are shown in Figure

Comparison of the predicted clusters with the known complexes.

**Comparison of the predicted clusters with the known complexes.** The number of matched known complexes with respect to different overlapping scores threshold (from 0 to 1 with a 0.1 increment) for result data sets generated by algorithm CPM using different parameter values and algorithm CP-DR.

In our experiment, we found that almost all of protein complexes identified by algorithm CPM could be accurately detected by algorithm CP-DR when protein complexes meeting the distance restriction condition. In addition, the introduction of distance restriction reasonable limits the size of protein complexes so that algorithm CP-DR could identify a large number of protein complexes with specific biological significance and biological functions more effectively, more precisely and more comprehensively. Table

Examples of protein complexes identified by algorithm CPM and algorithm CP-DR.

CPM

CP-DR

Sequence

Known Complex

Size

Size

YDR226c YER165w YKR002w

Complex1

YMR061w YOL123w YGL044c

6

0.833

YKR002w YMR061w YLR115w

13

0.089

Complex2

YAL013c YLR277c YNL317w

9

0.854

YJR093c YPR107c YDR301w

YPR041w YMR036c YBR079c

Complex3

YNL244c

8

0.443

6

0.795

YOR361w YMR146c YPL105c

YDR429c

YFL088c YKR068c YLR268w YIL004c

Complex4

YML077w YDR407c YOR115c

18

0.150

13

0.923

YMR218c

YBR254c YDR472w YGR166w

YDR246w

In Table

In our algorithm CP-DR, the size of protein complex is mainly restricted by distance constraint. It is precisely because of the introduction of distance restriction that the identified protein complexes with higher matching extent to known protein complexes and more prominent biological significance. A huge protein complex identified by algorithm CPM with clique percolation parameter ^{-3}. When we apply algorithm CP-DR to the same protein-protein networks, we indentify 1710 protein complexes which protein vertices and interactions included in the hugest protein complex identified by algorithm CPM. In Fig.

The largest protein complex identified by algorithm CPM with clique percolation parameter

**The largest protein complex identified by algorithm CPM with clique percolation parameter k=3.** This huge complex contains 865 proteins, 4508 pairs of interactions, which involves approximate a quarter of protein vertices and one third interactions of the protein-protein interaction network.

A section of predicted complexes by algorithm CP-DR with

**A section of predicted complexes by algorithm CP-DR with OS(Pc, Kc) ≥0.2.** All of proteins and interactions are included in Fig.

Examples of protein complexes identified by algorithm CP-DR.

**Examples of protein complexes identified by algorithm CP-DR.** There are four protein complexes identified by algorithm CP-DR, which size respectively corresponding to 12, 9, 6, 5, best matching to known protein complexes

Specificity and Sensitivity

where

where

Another integrated method, called the

The

Comparison of algorithm CP-DR and algorithm CPM in Sensitivity, Specificity and f-measure.

Algorithm

Parameter

CP-DR

0.872787611

0.391952310

0.540966747

0.213592233

0.247191011

0.229166667

CPM

0.155339806

0.524590164

0.239700375

0.092592593

0.722222222

0.164141415

As is known to all from Table

Overlapping Rate Analysis

Definition 1

Overlapping Rate: In undirected graph

According to the definition, we calculate overlapping rate defined by the following formula:

where _{v}_{i}

Since each protein might be involved in multiple biological processes in the real protein-protein interaction networks, that is to say it might belong to several protein complexes, it is necessary to decompose protein-protein interaction networks into overlapping nested structures. Moreover, many researches have proved that this measure is consistent with the practical situation. In our paper, the protein complex identified by algorithm CPM containing 685 vertices and 4508 pair interactions corresponds to 1710 protein complexes detected by our approach. In order to analyze the overlapping rate, we selected 58 members of 1710 protein complexes. According to the protein complexes existing overlap or not, we could construct

Overlapping complexes identified by algorithm CP-DR.

**Overlapping complexes identified by algorithm CP-DR.** According to the protein complexes existing overlap or not, we construct

By the analysis protein complexes detected by algorithm CP-DR and algorithm CPM, we found that a vast majority of proteins only subordinate one or two complex. The situation of three or more protein complexes contain a same protein is rare.

Table

Average overlapping rate of protein complexes identified by algorithm CP-DR and algorithm CPM.

CP-DR

CPM(

CPM(

CPM(

Overlapping Rate

2.103

1.192

1.115

1.093

33.613%

13.843%

10.526%

9.685%

Function Enrichment Analysis

In order to detect the functional characteristics of the predicted complexes, we compare the predicted complexes with known functional classification. The

where

There are 1896 predicted protein complexes match with the known functional categories with

According to the

Functional annotation of predicted complexes in Table

Complexes

ORF

Protein functional categories

Table

YLR268w

20.09.07.03

20.09.07.05

20.09.07.27

YKR068c

20.09.07.03

YML077w

20.09.07.03

YFL038c

14.10

20.09.07.03

YGR166w

01.05.25

20.09.07.03

YDR108w

10.03.02

20.09.07.03

43.01.03.09

YBR254c

20.09.07.03

YDR246w

20.09.07.03

YDR407c

20.09.07.03

YDR472w

20.09.07.03

YMR218c

20.09.07.03

YOR115c

20.09.07.03

YIL004c

20.09.07.03

20.09.07.27

Figure

YBL084c

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YDL008w

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YDR118w

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YFR036w

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YGL240w

10.03.01.01.11

14.07

14.13.01.01

16.01

YHR166c

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

42.04

YKL022c

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YLR127c

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YNL172w

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YOR249c

10.01.09.05

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YLR102c

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

YIR025w

10.03.01.01.11

14.07.05

14.10

14.13.01.01

16.01

16.19.03

The Code of Figures Represent Function: 01.05.25: regulation of C-compound and carbohydrate metabolism; 10.01.09.05: DNA conformation modification; 10.03.01.01.11: mitosis M phase; 10.03.02: meiosis; 14.07.05: modification by ubiquitination, deubiquitination; 14.10: assembly of protein complexes; 14.13.01.01: proteasomal degradation (ubiquitin/proteasomal pathway); 16.01: protein binding; 16.19.03: ATP binding; 20.09.07.03: ER to Golgi transport; 20.09.07.05: intra Golgi transport; 20.09.07.27: vesicle fusion; 42.04: cytoskeleton/structural proteins; 43.01.03.09: development of asco-basidio- or zygospore.

Conclusions

It is believed that identification of protein complexes is useful to explain certain biological progress and to predict functions of proteins. In this paper, we extended the definition of

We applied the algorithm CP-DR to the protein interaction network of

Methods

The protein interaction network of

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61003124 and 61073036, the National Basic Research 973 Program of China No.2008CB317107, the Ph.D. Programs Foundation of Ministry of Education of China No. 20090162120073, the U.S. National Science Foundation under Grants CCF-0514750, CCF-0646102, and CNS-0831634, and the Program for Changjiang Scholars and Innovative Research Team in University No. IRT0661. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).

This article has been published as part of