School of Computer Science and Engineering, Key Lab of Computer Network & Information Integration, MOE, Southeast University, Nanjing, 210018, China

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China

Division of Mathematics and Computer Science, University of South Carolina Upstate 800 University Way, Spartanburg, SC 29303, USA

Abstract

Background

Most computational algorithms mainly focus on detecting highly connected subgraphs in PPI networks as protein complexes but ignore their inherent organization. Furthermore, many of these algorithms are computationally expensive. However, recent analysis indicates that experimentally detected protein complexes generally contain Core/attachment structures.

Methods

In this paper, a Greedy Search Method based on Core-Attachment structure (GSM-CA) is proposed. The GSM-CA method detects densely connected regions in large protein-protein interaction networks based on the edge weight and two criteria for determining core nodes and attachment nodes. The GSM-CA method improves the prediction accuracy compared to other similar module detection approaches, however it is computationally expensive. Many module detection approaches are based on the traditional hierarchical methods, which is also computationally inefficient because the hierarchical tree structure produced by these approaches cannot provide adequate information to identify whether a network belongs to a module structure or not. In order to speed up the computational process, the Greedy Search Method based on Fast Clustering (GSM-FC) is proposed in this work. The edge weight based GSM-FC method uses a greedy procedure to traverse all edges just once to separate the network into the suitable set of modules.

Results

The proposed methods are applied to the protein interaction network of S. cerevisiae. Experimental results indicate that many significant functional modules are detected, most of which match the known complexes. Results also demonstrate that the GSM-FC algorithm is faster and more accurate as compared to other competing algorithms.

Conclusions

Based on the new edge weight definition, the proposed algorithm takes advantages of the greedy search procedure to separate the network into the suitable set of modules. Experimental analysis shows that the identified modules are statistically significant. The algorithm can reduce the computational time significantly while keeping high prediction accuracy.

Background

With the rapid development of technologies to predict protein interactions, huge data sets portrayed as networks have been available. Most real networks typically contain parts in which the nodes are more highly connected to each other than to the rest of the network. The sets of such nodes are usually called clusters, communities, or modules

Recently, many research works have been conducted to solve the problem of clustering protein interaction networks

Amin et al.

Adamcsek et al.

Above computational studies mainly focus on detecting highly connected subgraphs in PPI networks as protein complexes but ignore their inherent organization. However, recent analysis indicates that experimentally detected protein complexes generally contain Core/attachment structures. Protein complexes often include cores in which proteins are highly co-expressed and share high functional similarity. And core proteins are usually more highly connected to each other and may have higher essential characteristics and lower evolutionary rates than those of peripheral proteins

In this paper, a Greedy Search Method based on Core-Attachment structure called GSM-CA is introduced. Comparing with the other methods of core-attachment, the new edge weight calculation method and evaluation criterion for judging a node as a core node or an attachment node are proposed in our GSM-CA method. The GSM-CA method uses a pure greedy procedure to move a node between two different sets. The detected clusters are also core-attachment structures. In particular, GSM-CA firstly defines seed edges of the core from the neighbourhood graphs based on the highest weight and then detects protein-complex cores as the hearts of protein complexes. Finally, GSM-CA includes attachments into these cores to form biologically meaningful structures. The new algorithm is applied to the protein interaction network of S. cerevisiae. The modules identified by the new algorithm are mapped to the MIPS

The GSM-CA method achieves high accuracy. However, it is computationally expensive. Many module detection approaches are based on the traditional hierarchical methods, which is also computationally inefficient because the hierarchical tree structure generated by the repeated computational process cannot provide adequate information to identify whether a network belongs to a module structure or not. To further improve the computational process of these module detection approaches, the Greedy Search Method based on Fast Clustering (GSM-FC) is proposed in this paper. The edge weight based GSM-FC method uses the greedy procedure to traverse all edges just once to separate the network into the suitable sets of modules. The experimental results demonstrate that the newly proposed algorithm can reduce the computational time noticeably while maintaining high prediction accuracy compared to GSM-CA.

Briefly then, the outline of this paper is as follows. In Section 2 the implementation of our two methods are described in details. In Section 3, our algorithm is applied to the protein interaction network of S. cerevisiae yeast and the results are analyzed. In Section 4, the conclusions are given.

Methods

Definitions

Protein interaction networks can be represented as an undirected graph _{v}_{v }_{v }

The closeness _{nk }of any node n with respect to some node k in cluster c is defined by (1).

Here, _{n }_{k }

The DPClus algorithm defines the weight _{uv }

Therefore, the definition of _{uv }

Here _{uv }_{u }_{v}_{uv }_{j}_{k}_{j}_{k}_{j}_{k }_{uv}

The number of common neighbours between any two nodes is actually equal to the number of paths of length 2 between them. This definition of weight is used to cluster the graphs that have densely connected regions separated by sparse regions. In relatively sparse graphs, the nodes on the path of edges with length 3 or length 4 can be considered.

The highest edge weight of a node n is defined as _{n }_{nu}_{nv }_{n}

Greedy Search Method based on Core Attachment structure (GSM-CA)

Because core and peripheral proteins may have different roles and properties due to their different topological characteristics, a Greedy Search Method based on Core Attachment structure called GSM-CA is proposed based on the definition of the edge weight and two evaluation criterion for judging a node as a core node or an attachment node. GSM-CA uses a greedy procedure to get the suitable set of clusters. It first generates the core of a cluster, and then selects reliable attachments cooperating with the core to form the final cluster. The algorithm is divided into six steps: 1) Input & initialization; 2) Termination check; 3) Seed selection; 4) Core formation; 5) Attachments selection; 6) Output & update. The functional modules are determined by final clusters. The whole description of the GSM-CA algorithm is shown in the following.

Input & initialization

The input to the algorithm is an undirected simple graph and hence the associated matrix of the graph is read first. The user need decide the minimum value for closeness in cluster formation. The minimum value will be referred to as _{in}

Termination check

Once a cluster is generated, it is removed from the graph. The next cluster is then formed in the remaining graph and the process goes on until no seed edge whose weight is above one (i.e. _{uv }

Seed selection

Each cluster starts at a deterministic edge called the seed edge. The highest weight edge (n, v) of node n satisfying the condition that _{nv }_{n }

Core formation

A protein complex core is a small group of proteins which show a high co-expression patterns and share high degree of functional similarity. It is the key functional unit of the complex and largely determines the cellular role and essentiality of the complex

The core starts from a single edge and then grows gradually by adding nodes one by one from the neighbours. The neighbours of a core are the nodes connected to any node of the core but not part of the core. The core is referred to as C. For a neighbour u of C, if u's neighbour v linked by u's highest weight edge (u, v) is in C, u is considered to be included into the core. Before including u to C, the condition, _{uv }_{in }

Attachments selection

After the core of one cluster has been detected, the peripheral information of each core is extracted and reliable attachments cooperating with it are selected to form the final cluster. For each neighbour u of the core C, if u's neighbour v linked by u's highest weight edge (u, v) is in C, _{uv }_{uv }

If

Output & update

Once a cluster is generated, graph G is updated by removing the present cluster. The nodes belonging to the present cluster and the incident edges on these nodes are marked as clustered and not considered in the following. Then in the remaining graph, each node's highest edge weight is updated by not considering the edges that have been marked. The pseudocode of the GSM-CA algorithm is shown in Table

**Algorithm GSM-CA**

Input: a graph _{in }

Output: identified modules;

(1) Compute the edge weight

**For **each edge **do**

compute _{uv}

**End For**

(2) Form core

Select the edge

**If **_{uv }**then **exit;

**End If**

Initial core

**While **neighbour

is in _{ij }_{in }**do**

**End While**

(3) Select attachments for core

**For each **neighbor **do**

**If **

and **then**

**End If**

**End For each**

(4) Output results and update the highest edge weight

Output

(5) Repeat from step 2 to step 4, until reaching the termination condition of step 2.

Generation of overlapping clusters

In the above algorithm, once a cluster is generated it is marked as clustered and not considered in the following, and the next cluster is generated in the remaining graph. Therefore, non-overlapping clusters are generated. In order to generate overlapping clusters, the existing non-overlapping clusters are extended by adding nodes to them from their neighbours in the original graph (considering the marked nodes and edges). Then in the original graph excluding the edges between the nodes that have been marked as clustered, each node's highest edge weight is updated.

Greedy Search Method based on Fast Clustering (GSM-FC)

Many module detection approaches including GSM-CA is computationally expensive. The traditional hierarchical tree structure generated by these approaches can't provide adequate information to identify which subtree belongs to a module structure. As a result, the module structure need be evaluated repeatedly based on the module definition. During the computational process, the edge weight of neighbouring nodes need be recomputed after one edge is deleted. The edge weight calculation is based on the shortest path between vertices. Since the shortest path problem has high time complexity, these approaches are even not scalable for the networks with the medium size.

The GSM-FC can avoid repeated module structure evaluation because the module structure can be identified based on inherent network organization and the greedy algorithm. The GSM-FC traverses all edges once then generates the clusters. Moreover, the GSM-FC utilizes properties of subnetworks, which can reflect the network topology more effectively. As a result, the computational efficiency of the GSM-FC method can be improved noticeably.

The GSM-FC algorithm is divided into three steps: 1) Input & initialization; 2) Cluster formation; 3) Output.

Input & initialization

The input to the algorithm is an undirected simple graph and hence the associated matrix of the graph is read first. Each edge's weight is computed based on formula (2). All vertices in the graph G are initialized as singleton clusters at first step.

Cluster formation

During this step, all edges are traversed gradually and a greedy procedure is used to assemble the nodes into clusters. For one edge (u, v), if u and v are not in the same cluster, they are considered to be merged. If the edge weight _{uv }_{uv }_{uv }

Output

After all edges have been visited, the subnetworks generated during the cluster merging process are outputted. These subnetworks are considered as modules. The pseudocode of GSM-FC algorithm is shown in Table

**Algorithm GSM-FC**

Input: a graph

Output: identified modules;

(1)**For **each edge **do**

compute _{uv }; add _{q}

**End for**

(2)**While ****do**;

_{q};

**If ****then **//L is cluster label

**If **_{uv }_{u }_{uv }_{v }**then**

_{i}_{i}_{j}

**End if**

**End if**

**End while**

Efficiency analysis

Compared to the other algorithms, the advantage of the GSM-FC algorithm are computationally efficient. The GSM-FC algorithm just needs to visit all edges once without any parameter input. The time complexity of the clustering process is linear. The edge weight calculation is the most time-consuming step for the clustering process. Let n and m denote the number of vertices and edges in a protein interaction network respectively; k be the average number of neighbours of all the vertices, i.e. ^{2}^{2}

Experimental setup and result analysis

Data set and the criterion of performance evaluation

In order to evaluate effectiveness of the new system, our algorithm is applied to the full DIP (the Database of Interacting Proteins)

The experimental results are based on a reference dataset of known yeast protein complexes retrieved from the MIPS

The overlapping score ^{2}/(

S-measure, as the harmonic mean of sensitivity and specificity, can be used to evaluate the overall performance of the different techniques.

P-values are used to evaluate the biological significance of our predicted complexes. P-values represent the probability of co-occurrence of proteins with common functions. Low p-value of a predicted complex generally indicates that the collective occurrence of these proteins in the module does not happen merely by chance and thus the module has high statistical significance. In our experiments, the p-values of complexes are calculated by the tool called SGD's Go::TermFinder

Let the total number of proteins be N with a total of M proteins sharing a particular annotation. The p-value of observing m or more proteins that share the same annotation in a cluster of n proteins, using the Hyper-geometric Distribution is defined as (6):

The average f-measure is used to evaluate the overall significance of each algorithm. f-measure of an identified module is defined as a harmonic mean of its recall and precision

Where _{i }is a functional category mapped to module M. The proteins in functional category _{i }_{i }

Experimental results for GSM-CA method

Table

Results of various algorithms compared with MIPS complexes using DIP data

**Algorithms**

**MCODE**

**CFinder**

**DPClus**

**COACH**

**GSM-CA**

#predicted

59

245

1143

745

**353**

complexes

|TP|

18

52

133

155

**105**

|TB|

19

61

144

106

**119**

s-measure

0.132

0.231

0.198

0.307

**0.380**

Comparison of the results before and after adding attachments is shown in Table

Comparison of the results before and after adding attachments

**Average Size**

**f-measure of BP**

**-log(p-value)**

Before

5.29

0.356

7.2

**After**

**7.37**

**0.362**

**8.6**

Statistical significance of functional modules predicted by various methods

**Algorithms**

**No. of Modules size>=3**

**No. of Significant Modules**

**Average Size**

**Maximum**

**f-measure of BP**

**-log(p-value)**

**Parameters**

MCODE

59

54

83.8

549

0.296

10.87

fluff = 0.1; VWP = 0.2

CFinder

245

157

10.2

1409

0.246

4.49

K = 3

DPClus

217

187

5.23

25

0.335

6.78

Density = 0.7;CP_{in}=0.5

COACH

746

608

8.54

44

0.272

6.96

Null

**GSM-CA**

**187**

**168**

**7.37**

**79**

**0.362**

**8.6**

**CN _{in}=0.5**

**GSM-FC**

**113**

**106**

**9.65**

**118**

**0.359**

**10.46**

**Null**

Comparison of f-measure based on three types of GO of GSM-CA and other algorithms

**Comparison of f-measure based on three types of GO of GSM-CA and other algorithms**.

Table

List of top ten scoring modules identified by GSM-CA and their most enriched GO terms for Biological Process

**ID**

**Size of module**

**Number of proteins enriched the same GO Term**

**Size of GO Term**

**Name of GO Term**

**p-value**

1

35

33

221

rRNA processing

1.70e-41

2

42

38

323

ribosome biogenesis

1.61e-39

3

13

13

15

tRNA transcription

1.92e-35

4

19

14

19

mRNA polyadenylation

1.96e-31

5

11

11

12

cyclin catabolic process

8.92e-31

6

14

12

14

polyadenylation-dependent snoRNA 3'-end processing

1.91e-30

7

19

18

93

mitochondrial translation

8.08e-30

8

16

14

29

energy coupled proton transport, down electrochemical gradient

1.33e-29

9

22

14

20

RNA polymerase II transcriptional preinitiation complex assembly

2.70e-29

10

19

18

101

nuclear mRNA splicing, via spliceosome

8.91e-29

Figure

An example of modules identified by the GSM-CA method

**An example of modules identified by the GSM-CA method**.

The GSM-CA method used the parameter _{in }_{in }_{in }_{in }_{in }_{in }_{in }_{in }

The effects of cnin on clustering

**The effects of cn _{in} on clustering**. (a) The size of the biggest cluster (b) The total number of the clusters whose size is greater than 2 (c) The average size of the clusters whose size is greater than 2 (d) The average f-measure.

Experimental results for the GSM-FC method

Table

Comparison of the running time of the GSM-FC algorithm and other algorithms

**Algorithms**

**The running time**

MCODE

71.5 s

COACH

6.8 s

CFinder

24.4 s

DPClus

926.0 s

GSM-AC

82.6 s

**GSM-FC**

**3.4 s**

Table

Conclusions

Identification of functional modules is crucial to the understanding of the structural and functional properties of protein interaction networks. The increasing amount of protein interaction data has enabled us to detect protein functional modules. In this paper, a Greedy Search Method based on Core-Attachment structure called GSM-CA is proposed to mine functional modules from the protein interaction networks. Because core and peripheral proteins may have different roles and properties due to their different topological characteristics, the GSM-CA method defines edge weight and two criterion for determining core nodes and attachment nodes. It first generates the core of a module, and then forms the module by including attachments into the core. The GSM-CA method is applied to the typical PPI networks of S. cerevisiae. The MIPS benchmark and the GO annotation are used to validate the identified modules and compare the performances of our algorithm with several other algorithms including MCODE, CFinder, DPClus, and COACH. The evaluation and analysis show that most of the functional modules predicted by our algorithm have high functional similarity and match well with the benchmark. The quantitative comparisons reveal that our algorithm outperforms the other competing algorithms. Many module detection approaches utilize the traditional hierarchical clustering methods, which are computationally costly because the tree structure produced by the hierarchical clustering methods can not provide adequate information to identify whether a network belongs to a module structure or not. To overcome these problems, the Greedy Search Method based on Fast Clustering (GSM-FC) is proposed. The GSM-FC method takes advantages of the greedy search procedure to separate the network into the suitable set of modules. The experimental results show that the GSM-FC method can reduce the computational time significantly while keeping high prediction accuracy compared to GSM-CA. For the future work, the algorithm need be applied to the weighted graph. How to incorporate diverse biological information into the explorative analysis of protein complexes in PPI networks is another interesting question for further research.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JH supervised the work, and JH, BY and WZ contributed to the problem formulation and paper writing. JH and CL conducted research on the algorithms of GSM-CA and GSM-FC, and CL developed and implemented the algorithms. The manuscript was drafted by JH and CL. All authors read and approved the final manuscript.

Acknowledgements

This article has been published as part of

The authors would like to thank Bader G. and Hogue C. for their sharing the tool of MCODE, to Adamcsek B., Palla G., Farkas I., Derenyi I., and Vicsek T. for their publicity of CFinder. The authors are also thankful to Altaf-UI-Amin Md, Shinbo Y., Mihara K., Kurokawa K., and Kanaya S. for their kindly sharing the tool of DPClus, to Wu M., Li X., Kwoh C.-K. and Ng S.-K. for their sharing the source code of COACH. The authors also thank the anonymous reviewers for their helpful and constructive suggestions.

This research work is supported by State Key Laboratory for Novel Software Technology of Nanjing University (KFKT2010B03) and Open Research Foundation of Key Laboratory for Computer Network and Information Integration, Southeast University (K93-9-2010-19).