Abstract
Background
Most computational algorithms mainly focus on detecting highly connected subgraphs in PPI networks as protein complexes but ignore their inherent organization. Furthermore, many of these algorithms are computationally expensive. However, recent analysis indicates that experimentally detected protein complexes generally contain Core/attachment structures.
Methods
In this paper, a Greedy Search Method based on CoreAttachment structure (GSMCA) is proposed. The GSMCA method detects densely connected regions in large proteinprotein interaction networks based on the edge weight and two criteria for determining core nodes and attachment nodes. The GSMCA method improves the prediction accuracy compared to other similar module detection approaches, however it is computationally expensive. Many module detection approaches are based on the traditional hierarchical methods, which is also computationally inefficient because the hierarchical tree structure produced by these approaches cannot provide adequate information to identify whether a network belongs to a module structure or not. In order to speed up the computational process, the Greedy Search Method based on Fast Clustering (GSMFC) is proposed in this work. The edge weight based GSMFC method uses a greedy procedure to traverse all edges just once to separate the network into the suitable set of modules.
Results
The proposed methods are applied to the protein interaction network of S. cerevisiae. Experimental results indicate that many significant functional modules are detected, most of which match the known complexes. Results also demonstrate that the GSMFC algorithm is faster and more accurate as compared to other competing algorithms.
Conclusions
Based on the new edge weight definition, the proposed algorithm takes advantages of the greedy search procedure to separate the network into the suitable set of modules. Experimental analysis shows that the identified modules are statistically significant. The algorithm can reduce the computational time significantly while keeping high prediction accuracy.
Background
With the rapid development of technologies to predict protein interactions, huge data sets portrayed as networks have been available. Most real networks typically contain parts in which the nodes are more highly connected to each other than to the rest of the network. The sets of such nodes are usually called clusters, communities, or modules [14]. The presence of biologically relevant functional modules in ProteinProtein Interaction (PPI) graphs has been confirmed by many researchers [4,5]. Identification of functional modules is crucial to the understanding of the structural and functional properties of networks [6,7]. There is a major distinction between two biological concepts, namely, protein complexes and functional modules [7]. A protein complex is a physical aggregation of several proteins (and possibly other molecules) via molecular interaction (binding) with each other at the same location and time. A functional module also consists of a number of proteins (and other molecules) that interact with each other to control or perform a particular cellular function. Unlike protein complexes, proteins in a functional module do not necessarily interact at the same time and location. In this paper, we do not distinguish protein complexes from functional modules because the protein interaction data used for detecting protein complex in this work do not provide temporal and spatial information.
Recently, many research works have been conducted to solve the problem of clustering protein interaction networks [810]. Some of them are using the graphbased clustering methods for mining functional modules [11,1720]. These studies are mainly based on the observation that densely connected regions in the PPI networks often correspond to actual protein functional modules. In short, methods proposed in these studies are used to detect densely connected regions of a graph that are separated by sparse regions. Some graph clustering approaches using PPI networks for mining functional modules are introduced in the following. Bader and Hogue [17] proposed the Molecular COmplex Detection (MCODE) algorithm that utilizes connectivity values in protein interaction graphs to mine for protein complexes. The algorithm first computes the vertex weight value from its neighbour density and then traverses outward from a seed protein with a high weighting value to recursively include neighbouring vertices whose weights are above a given threshold. However, since the highly weighted vertices may not be highly connected to each other, the algorithm does not guarantee that the discovered regions are dense. A simultaneous protein interaction network (SPIN) introduced by Jung et al [12] specifies mutually exclusive interactions (MEIs). Taking advantages of the SPINs, SPIN_MCODE has outperformed the plain MCODE method.
Amin et al. [18] proposed a cluster periphery tracking algorithm (DPClus) to detect protein complexes by keeping track of the periphery of a detected cluster. DPClus first weighs each edge based on the common neighbours between two proteins and further weighs nodes by their weighted degree. To form a protein complex, DPClus first selects the seed node having the highest weight as the initial cluster and then iteratively augments this cluster by including vertices one by one, which are out of but closely related with the current cluster. Li et al. [13] modified the DPClus algorithm for identifying protein complexes that have a small diameter (or a small average vertex distance) and satisfy a different cluster connectivitydensity property. The performance of such algorithms depends heavily on the quality of the seeds and the criterion of extending clusters.
Adamcsek et al. [19] provided a software called CFinder to find functional modules in PPI networks. CFinder detects the kclique percolation clusters as functional modules using a Clique Percolation Method (CPM)[20]. In particular, a kclique is a clique with k nodes and two kcliques are adjacent if they share (k  1) common nodes. A kclique percolation cluster is then constructed by linking all the adjacent kcliques as a bigger subgraph. Li et al. [14] proposed a new clustering algorithm called IPCMCE to identify protein complexes based on maximal clique, and then extend all the maximal cliques by adding their neighbourhoods iteratively. Liu et al [15] developed an algorithm called Clustering based on Maximal Cliques (CMC) to discover complexes from the weighted PPI network. CMC first finds maximal cliques from PPI networks, and then removes or merges highly overlapped maximal cliques based on their interconnectivity. However, CMC generates less number of significant functional modules having Pvalue less than 1E5 than the DPClus algorithm in the unweighted PPI network [11]. Wang et al. [16] also developed an algorithm called CPDR based on the new topological model for identifying protein complexes. Wang's algorithm extended the definition of kclique community of the CPM algorithm and introduced distance restriction.
Above computational studies mainly focus on detecting highly connected subgraphs in PPI networks as protein complexes but ignore their inherent organization. However, recent analysis indicates that experimentally detected protein complexes generally contain Core/attachment structures. Protein complexes often include cores in which proteins are highly coexpressed and share high functional similarity. And core proteins are usually more highly connected to each other and may have higher essential characteristics and lower evolutionary rates than those of peripheral proteins [26]. A protein complex core is often surrounded by some attachments, which assist the core to perform subordinate functions. Gavin et al.'s work [28] also demonstrates the similar architecture and modularity for protein complexes. Therefore, protein complexes have their inherent organization [26,27,29] of coreattachment. To provide insights into the inherent organization of protein complexes, some methods [21,26,29] are proposed to detect protein complexes in two stages. In the first stage, protein complex cores, as the heart of the protein complexes, are first detected. In the second stage, protein complexes are expanded by incorporating attachments into the protein complex cores. Wu et al. [21] presented a COreAttaCHment based method (COACH) and Leung et al. also developed an approach called CoreMethod. These approaches are used to detect protein complexes in PPI networks by identifying their cores and attachments separately [29]. To detect cores, COACH performs local search within vertex's neighbourhood graphs while the CoreMethod [29] computes the pvalues between all the proteins in the whole PPI networks.
In this paper, a Greedy Search Method based on CoreAttachment structure called GSMCA is introduced. Comparing with the other methods of coreattachment, the new edge weight calculation method and evaluation criterion for judging a node as a core node or an attachment node are proposed in our GSMCA method. The GSMCA method uses a pure greedy procedure to move a node between two different sets. The detected clusters are also coreattachment structures. In particular, GSMCA firstly defines seed edges of the core from the neighbourhood graphs based on the highest weight and then detects proteincomplex cores as the hearts of protein complexes. Finally, GSMCA includes attachments into these cores to form biologically meaningful structures. The new algorithm is applied to the protein interaction network of S. cerevisiae. The modules identified by the new algorithm are mapped to the MIPS [22] benchmark complexes and validated by GO [23] annotations. The experimental results show that the identified modules are statistically significant. In terms of prediction accuracy, the GSMCA method outperforms several other competing algorithms. Moreover, most of the previous methods can not detect the overlapping functional modules by generating separate subgraphs. But GSMCA can not only generate nonoverlapping clusters, but also overlapping clusters.
The GSMCA method achieves high accuracy. However, it is computationally expensive. Many module detection approaches are based on the traditional hierarchical methods, which is also computationally inefficient because the hierarchical tree structure generated by the repeated computational process cannot provide adequate information to identify whether a network belongs to a module structure or not. To further improve the computational process of these module detection approaches, the Greedy Search Method based on Fast Clustering (GSMFC) is proposed in this paper. The edge weight based GSMFC method uses the greedy procedure to traverse all edges just once to separate the network into the suitable sets of modules. The experimental results demonstrate that the newly proposed algorithm can reduce the computational time noticeably while maintaining high prediction accuracy compared to GSMCA.
Briefly then, the outline of this paper is as follows. In Section 2 the implementation of our two methods are described in details. In Section 3, our algorithm is applied to the protein interaction network of S. cerevisiae yeast and the results are analyzed. In Section 4, the conclusions are given.
Methods
Definitions
Protein interaction networks can be represented as an undirected graph G = (V , E), where V is the set of vertices and E = {(u,v) u,v ∈ V} is the set of edges between the vertices. For a node v ∈ V , the set of v's direct neighbours is denoted as N_{v}. N_{v }is defined as N_{v }= {u u ∈ V,(u,v) ∈ E}. Before introducing details of the algorithm, some terminologies used in this paper are defined.
The closeness cn_{nk }of any node n with respect to some node k in cluster c is defined by (1).
Here, NC_{n }is the set of n's direct neighbours in cluster c, and NC_{k }is the set of k's direct neighbours in cluster c.
The DPClus algorithm defines the weight w_{uv }of an edge (u,v) ∈ E as the number of the common neighbours of the nodes u and v. It is likely that two nodes that belong to the same cluster have more common neighbours than two nodes that do not. For two edges having the same number of common neighbours, the one that has more interactions between the common neighbours is more likely to belong to the same cluster.
Therefore, the definition of w_{uv }is modified in the paper by (2)
Here N_{uv }= N_{u }∩ N_{v}, E_{uv }= {(v_{j}, v_{k}) (v_{j}, v_{k}) ∈ E, v_{j}, v_{k }∈ N_{uv}} and α is the interaction factor to indicate how important the interactions are. α's default value is set as 1.
The number of common neighbours between any two nodes is actually equal to the number of paths of length 2 between them. This definition of weight is used to cluster the graphs that have densely connected regions separated by sparse regions. In relatively sparse graphs, the nodes on the path of edges with length 3 or length 4 can be considered.
The highest edge weight of a node n is defined as hw_{n }= max (w_{nu}) for all u such that (n, u) ∈ E. The highest weight edge (n, v) of node n is the edge satisfying the condition that w_{nv }= hw_{n}.
Greedy Search Method based on Core Attachment structure (GSMCA)
Because core and peripheral proteins may have different roles and properties due to their different topological characteristics, a Greedy Search Method based on Core Attachment structure called GSMCA is proposed based on the definition of the edge weight and two evaluation criterion for judging a node as a core node or an attachment node. GSMCA uses a greedy procedure to get the suitable set of clusters. It first generates the core of a cluster, and then selects reliable attachments cooperating with the core to form the final cluster. The algorithm is divided into six steps: 1) Input & initialization; 2) Termination check; 3) Seed selection; 4) Core formation; 5) Attachments selection; 6) Output & update. The functional modules are determined by final clusters. The whole description of the GSMCA algorithm is shown in the following.
Input & initialization
The input to the algorithm is an undirected simple graph and hence the associated matrix of the graph is read first. The user need decide the minimum value for closeness in cluster formation. The minimum value will be referred to as cn_{in}. Each edge's weight is computed based on formula (2). It is computed just once and will not be recalculated in the following steps.
Termination check
Once a cluster is generated, it is removed from the graph. The next cluster is then formed in the remaining graph and the process goes on until no seed edge whose weight is above one (i.e. w_{uv }> 1) can be found in the remaining graph.
Seed selection
Each cluster starts at a deterministic edge called the seed edge. The highest weight edge (n, v) of node n satisfying the condition that w_{nv }=hw_{n }is considered as the seed edge in the remaining graph.
Core formation
A protein complex core is a small group of proteins which show a high coexpression patterns and share high degree of functional similarity. It is the key functional unit of the complex and largely determines the cellular role and essentiality of the complex [21,2628]. For example, a protein in a core often has many interacting partners and protein complex cores often correspond to small, dense and reliable subgraphs in PPI networks [28].
The core starts from a single edge and then grows gradually by adding nodes one by one from the neighbours. The neighbours of a core are the nodes connected to any node of the core but not part of the core. The core is referred to as C. For a neighbour u of C, if u's neighbour v linked by u's highest weight edge (u, v) is in C, u is considered to be included into the core. Before including u to C, the condition, cn_{uv }>= cn_{in }, is checked and the neighbour whose highest edge weight is largest is included. This process goes on until no such neighbour can be found, and then the core of one cluster is generated.
Attachments selection
After the core of one cluster has been detected, the peripheral information of each core is extracted and reliable attachments cooperating with it are selected to form the final cluster. For each neighbour u of the core C, if u's neighbour v linked by u's highest weight edge (u, v) is in C, is computed. V_{uv }is the common neighbours of u and v in the core C. N_{uv }is the common neighbours of u and v in graph G.
If , u will be selected as an attachment. After all neighbours of the core are checked, the final cluster is generated.
Output & update
Once a cluster is generated, graph G is updated by removing the present cluster. The nodes belonging to the present cluster and the incident edges on these nodes are marked as clustered and not considered in the following. Then in the remaining graph, each node's highest edge weight is updated by not considering the edges that have been marked. The pseudocode of the GSMCA algorithm is shown in Table 1.
Generation of overlapping clusters
In the above algorithm, once a cluster is generated it is marked as clustered and not considered in the following, and the next cluster is generated in the remaining graph. Therefore, nonoverlapping clusters are generated. In order to generate overlapping clusters, the existing nonoverlapping clusters are extended by adding nodes to them from their neighbours in the original graph (considering the marked nodes and edges). Then in the original graph excluding the edges between the nodes that have been marked as clustered, each node's highest edge weight is updated.
Greedy Search Method based on Fast Clustering (GSMFC)
Many module detection approaches including GSMCA is computationally expensive. The traditional hierarchical tree structure generated by these approaches can't provide adequate information to identify which subtree belongs to a module structure. As a result, the module structure need be evaluated repeatedly based on the module definition. During the computational process, the edge weight of neighbouring nodes need be recomputed after one edge is deleted. The edge weight calculation is based on the shortest path between vertices. Since the shortest path problem has high time complexity, these approaches are even not scalable for the networks with the medium size.
The GSMFC can avoid repeated module structure evaluation because the module structure can be identified based on inherent network organization and the greedy algorithm. The GSMFC traverses all edges once then generates the clusters. Moreover, the GSMFC utilizes properties of subnetworks, which can reflect the network topology more effectively. As a result, the computational efficiency of the GSMFC method can be improved noticeably.
The GSMFC algorithm is divided into three steps: 1) Input & initialization; 2) Cluster formation; 3) Output.
Input & initialization
The input to the algorithm is an undirected simple graph and hence the associated matrix of the graph is read first. Each edge's weight is computed based on formula (2). All vertices in the graph G are initialized as singleton clusters at first step.
Cluster formation
During this step, all edges are traversed gradually and a greedy procedure is used to assemble the nodes into clusters. For one edge (u, v), if u and v are not in the same cluster, they are considered to be merged. If the edge weight w_{uv }is u's highest edge weight, and then an edge from u to v is added in order to merge the cluster including u into the cluster including v. Similarly, if the edge weight w_{uv }is v's highest edge weight, and then an edge between v and u is added in order to merge the cluster including v with the cluster including u. If the edge weight w_{uv }is neither u's highest edge weight nor v's highest edge weight, the edge (u, v) is ignored and the next edge is evaluated.
Output
After all edges have been visited, the subnetworks generated during the cluster merging process are outputted. These subnetworks are considered as modules. The pseudocode of GSMFC algorithm is shown in Table 2.
Efficiency analysis
Compared to the other algorithms, the advantage of the GSMFC algorithm are computationally efficient. The GSMFC algorithm just needs to visit all edges once without any parameter input. The time complexity of the clustering process is linear. The edge weight calculation is the most timeconsuming step for the clustering process. Let n and m denote the number of vertices and edges in a protein interaction network respectively; k be the average number of neighbours of all the vertices, i.e. ; Then, the complexity of calculating all the edge clustering coefficients is O(k^{2}m). Since the time complexity of the hierarchical merging process is O(m), the total time complexity of the GSMFC algorithm is O(k^{2}m). In general, k is much smaller than the number of vertices n and can be considered as a constant because it is well known that the protein interaction network is scalefree, in which most proteins only participate in a small number of interactions [31].
Experimental setup and result analysis
Data set and the criterion of performance evaluation
In order to evaluate effectiveness of the new system, our algorithm is applied to the full DIP (the Database of Interacting Proteins) [24] yeast dataset, which consists of 17201 interactions among 4930 proteins [21]. It is more complex and difficult to identify the modules using the full dataset than using the core dataset. The performance of our method is compared with several competing algorithms including MCODE, CFinder, DPClus, and COACH. The parameter selection for these algorithms is based on authors' recommendation. Several metrics including fmeasures and pvalue are used for rigorous performance evaluation.
The experimental results are based on a reference dataset of known yeast protein complexes retrieved from the MIPS [22]. While it is probably one of the most comprehensive public datasets of yeast complexes available up to date, it is by no means a complete datasetthere are still many yeast complexes that need to be discovered. After filtering the predicted protein complexes and complexes composed of a single protein from the dataset, a final set of 214 yeast complexes are used as our evaluation benchmark.
The overlapping score [17] between a predicted complex and a real complex in the benchmark, OS(p, b) = i^{2}/(p*b), is used to determine whether these complexes match with each other, where i is the size of the intersection set of a predicted complex with a known complex, p is the size of the predicted complex and b is the size of the known complex. If OS(p, b) ≥ ω, they are considered to be matching (ω is set as 0.20 which is adopted in the MCODE paper [17]). We assume that P is the sets of complexes predicted by a computational method and B is the sets of target complexes in the benchmark respectively. The set of True Positives (TP) is defined as TP = {pp ∈ P, ∃b ∈ B, OS(p, b) ≥ ω}, while the set of False Negatives (FN) is defined as FN = {bp ∈ P, b ∈ B, ∀ p(OS(p, b) < ω)}. The set of False Positives (FP) is FP = P  TP , while the set of known benchmark complexes matched by predicted complexes (TB) is TB = B  FN. The sensitivity and specificity [17] are defined as:
Smeasure, as the harmonic mean of sensitivity and specificity, can be used to evaluate the overall performance of the different techniques.
Pvalues are used to evaluate the biological significance of our predicted complexes. Pvalues represent the probability of cooccurrence of proteins with common functions. Low pvalue of a predicted complex generally indicates that the collective occurrence of these proteins in the module does not happen merely by chance and thus the module has high statistical significance. In our experiments, the pvalues of complexes are calculated by the tool called SGD's Go::TermFinder [23]. SDG's Go: TermFinder uses all the three types of ontology including Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). The cutoff of the pvalue is set as 0.01. The average log(pvalue) of all modules is calculated by mapping each module to the annotation with the lowest pvalue.
Let the total number of proteins be N with a total of M proteins sharing a particular annotation. The pvalue of observing m or more proteins that share the same annotation in a cluster of n proteins, using the Hypergeometric Distribution is defined as (6):
The average fmeasure is used to evaluate the overall significance of each algorithm. fmeasure of an identified module is defined as a harmonic mean of its recall and precision [25].
Where F_{i }is a functional category mapped to module M. The proteins in functional category F_{i }are considered as true predictions, the proteins in module M are considered as positive predictions, and the common proteins of F_{i }and M are considered as true positive predictions. Recall is the fraction of the truepositive predictions out of all the true predictions, and precision is the fraction of the true positive predictions out of all the positive predictions [25]. The average fmeasure value of all modules is calculated by mapping each module to the function with the highest fmeasure value.
Experimental results for GSMCA method
Table 3 compares results obtained by several popular methods with MIPS benchmark complexes. Table 3 indicates that the number of correctly predicted complexes using MCODE, CFinder, DPClus and GSMCA is less than the number of benchmark complexes matched by predicted complexes. But COACH is opposite. Because COACH detects the clusters from each node, the overlapping rate is high. Although the redundancyfiltering procedure is used, some predicted complexes are still similar and match the same benchmark complex. Table 3 indicates that the smeasure of COACH (0.307) is highest among the methods of MCODE, CFinder and DPClus. The smeasure of GSMCA (0.380) is significantly higher than that of COACH. In addition, the overall performance of COACH is much better than CoreMethod[21] which is another approach based on coreattachment structure.
Table 3. Results of various algorithms compared with MIPS complexes using DIP data
Comparison of the results before and after adding attachments is shown in Table 4. The comparison shows that after adding attachments, the average size of modules grows from average size of 5.29 into 7.37. Moreover, the fmeasure of BP and log(pvalue) have improved noticeably after adding attachments. All of these indicate that protein complexes indeed contain Core/attachment structures. Comparisons of biological significance of modules predicted by several algorithms are shown in Table 5. MCODE is not considered since it just generates a small number of modules. Table 5 indicates that the proportion of significant modules predicted by GSMCA is highest and log(pvalue) of GSMCA is also higher than the other algorithms. Moreover, in all of the other methods, the average fmeasure of DPClus is highest (0.335), however, the average fmeasure of GSMCA is 0.362, which is higher than that of DPClus. The detailed comparison of fmeasure based on all the three types of Gene Ontology (GO) Terms including Biological Process, Molecular Function, and Cellular Component is shown in Figure 1. Figure 1 indicates that the average fmeasure of Cellular Component GSMCA is also highest (0.453) in all of the methods.
Table 4. Comparison of the results before and after adding attachments
Table 5. Statistical significance of functional modules predicted by various methods
Figure 1. Comparison of fmeasure based on three types of GO of GSMCA and other algorithms.
Table 6 lists the top 10 most significant modules identified by the GSMCA method. They are sorted in the increasing order based on the pvalue.
Table 6. List of top ten scoring modules identified by GSMCA and their most enriched GO terms for Biological Process
Figure 2 visualizes the structure of the modules identified by the GSMCA method. The yellow nodes form the core and the red nodes represent the attachments.
Figure 2. An example of modules identified by the GSMCA method.
The GSMCA method used the parameter cn_{in }, and the effects of changing the parameter cn_{in }for cluster generation are shown in Figure 3. When cn_{in }changes from 0.1 to 0.9, the size of the biggest cluster and the average size of clusters decrease but the number of cluster increases. The sizes of the biggest overlapping clusters are same as that of the nonoverlapping clusters, so Figure 3(a) just draws one line. In Figure 3(b), the total number of the overlapping clusters is more than that of the nonoverlapping clusters. In Figure 3(c), the average size of the overlapping clusters is bigger than that of the nonoverlapping clusters. The effect of cn_{in }on fmeasure is shown in Figure 3(d). Figure 3 indicates that the fmeasure is relatively lower when cn_{in }>0.5. Because when cn_{in }is close to 1, the core of cluster is almost clique. It may be too strict to match well with the known annotations. The fmeasure is basically stable when cn_{in }<= 0.5. So cn_{in }is set as 0.5.
Figure 3. The effects of cn_{in} on clustering. (a) The size of the biggest cluster (b) The total number of the clusters whose size is greater than 2 (c) The average size of the clusters whose size is greater than 2 (d) The average fmeasure.
Experimental results for the GSMFC method
Table 7 compares the running time of the GSMFC method with that of other functional module identification algorithms. These algorithms are applied to the full DIP yeast dataset, which consists of 17201 interactions among 4930 proteins. Table 7 shows that the running time for the GSMFC method is shortest since it just visits all edges once. Since COACH detects cores from each vertex in the network once, the running time for COACH is also small, but it is greater than the running time of GSMFC. CFinder uses an efficient method called CPM to detect maximum cliques, so it is not timeconsuming. DPClus needs many sorting and computing, so it is computationally costly.
Table 7. Comparison of the running time of the GSMFC algorithm and other algorithms
Table 5 compares the biological significance of modules predicted by several algorithms. The improved average fmeasure and log(pvalue) demonstrate that the modules identified by the GSMFC method have higher statistical significance than other methods. The GSMFC method generates fewer numbers of clusters with the bigger average cluster size compared to the GSMCA method. Both Table 5 and Table 7 show that the GSMFC method can reduce the computational time noticeably while keeping high prediction accuracy compared to GSMCA. Furthermore, the GSMFC method which doesn't require any input parameters can be applied to even larger protein interaction networks.
Conclusions
Identification of functional modules is crucial to the understanding of the structural and functional properties of protein interaction networks. The increasing amount of protein interaction data has enabled us to detect protein functional modules. In this paper, a Greedy Search Method based on CoreAttachment structure called GSMCA is proposed to mine functional modules from the protein interaction networks. Because core and peripheral proteins may have different roles and properties due to their different topological characteristics, the GSMCA method defines edge weight and two criterion for determining core nodes and attachment nodes. It first generates the core of a module, and then forms the module by including attachments into the core. The GSMCA method is applied to the typical PPI networks of S. cerevisiae. The MIPS benchmark and the GO annotation are used to validate the identified modules and compare the performances of our algorithm with several other algorithms including MCODE, CFinder, DPClus, and COACH. The evaluation and analysis show that most of the functional modules predicted by our algorithm have high functional similarity and match well with the benchmark. The quantitative comparisons reveal that our algorithm outperforms the other competing algorithms. Many module detection approaches utilize the traditional hierarchical clustering methods, which are computationally costly because the tree structure produced by the hierarchical clustering methods can not provide adequate information to identify whether a network belongs to a module structure or not. To overcome these problems, the Greedy Search Method based on Fast Clustering (GSMFC) is proposed. The GSMFC method takes advantages of the greedy search procedure to separate the network into the suitable set of modules. The experimental results show that the GSMFC method can reduce the computational time significantly while keeping high prediction accuracy compared to GSMCA. For the future work, the algorithm need be applied to the weighted graph. How to incorporate diverse biological information into the explorative analysis of protein complexes in PPI networks is another interesting question for further research.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JH supervised the work, and JH, BY and WZ contributed to the problem formulation and paper writing. JH and CL conducted research on the algorithms of GSMCA and GSMFC, and CL developed and implemented the algorithms. The manuscript was drafted by JH and CL. All authors read and approved the final manuscript.
Acknowledgements
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10.
The authors would like to thank Bader G. and Hogue C. for their sharing the tool of MCODE, to Adamcsek B., Palla G., Farkas I., Derenyi I., and Vicsek T. for their publicity of CFinder. The authors are also thankful to AltafUIAmin Md, Shinbo Y., Mihara K., Kurokawa K., and Kanaya S. for their kindly sharing the tool of DPClus, to Wu M., Li X., Kwoh C.K. and Ng S.K. for their sharing the source code of COACH. The authors also thank the anonymous reviewers for their helpful and constructive suggestions.
This research work is supported by State Key Laboratory for Novel Software Technology of Nanjing University (KFKT2010B03) and Open Research Foundation of Key Laboratory for Computer Network and Information Integration, Southeast University (K939201019).
References

Everitt BS: Cluster Analysis. 3rd edition. Edward Arnold: London; 1993.

Newman MEJ: Detecting community structure in networks.
The European Physical Journal B  Condensed Matter and Complex Systems 2004, 38:321330. Publisher Full Text

Watts DJ, Dodds PS, Newman MEJ: Identity and search in social networks.
Science 2002, 296:13021305. PubMed Abstract  Publisher Full Text

Girvan M, Newman ME: Community structure in social and biological networks.
Proceedings of the National Academy of Science 2002, 99(12):78217826. Publisher Full Text

Brun C, Herrmann C, Guenoche A: Clustering proteins from interaction networks for the prediction of cellular functions.
BMC Bioinformatics 2004, 5:95. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ: Largescale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters.
Nature Genetics 2002, 31:255265. PubMed Abstract  Publisher Full Text

Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks.
Proceedings of the National Academy of Science USA 2003, 100(21):1212312128. Publisher Full Text

Gao L, Sun PG: Clustering Algorithms for detecting functional modules in protein interaction networks.
Journal of Bioinformatics and Computational Biology 2009, 7:126. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li X, Wu M, Kwoh CK, Ng Sk: Computational approaches for detecting protein complexes from protein interaction networks: a survey.
BMC Genomics 2010, 11(Suppl 1):S3. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wang J, Li M, Deng Y, Pan Y: Recent Advances in Clustering Methods for Protein Interaction Networks.
BMC Genomics 2010, 11(Suppl 3):S10. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wang J, Li M, Chen J, Pan Y: A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 2011, 8(3):607620. PubMed Abstract  Publisher Full Text

Jung SH, Hyun B, Jang W, Hur H, Han D: Protein complex prediction based on simultaneous protein interaction network.
Bioinformatics 2010, 26(3):385391. PubMed Abstract  Publisher Full Text

Li M, Chen J, Wang J, Hu B, Chen G: Modifying the DPClus algorithm for identifying protein complexes based on new topological structures.
BMC Bioinformatics 2008, 9:398. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Li M, Wang J, Chen J, Cai Z, Chen G: Identifying the Overlapping Complexes in Protein Interaction Networks.
Int J DataMing and Bioinformatics (IJDMB) 2010, 4(1):91108. Publisher Full Text

Wang J, Liu B, Li M, Pan Y: Identifying protein complexes from interaction networks based on clique percolation and distance restriction.
BMC Genomics 2010, 11(Suppl 2):S10. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bader GD, Hogue CW: An Automated Method for Finding Molecular Complexes in Large Protein Interaction Networks.
BMC Bioinformatics 2003, 4:2. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

AltafUlAmin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks.

Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks.
Bioinformatics 2006, 22(8):10211023. PubMed Abstract  Publisher Full Text

Palla G, Dernyi I, Farkas I, et al.: Uncovering the overlapping community structure of complex networks in nature and society.
Nature 2005, 435(7043):814818. PubMed Abstract  Publisher Full Text

Wu M, Li XL, Kwoh CK, Ng SK: A CoreAttachment based Method to Detect Protein Complexes in PPI Networks.
BMC Bioinformatics 2009, 10:169. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Mewes HW, et al.: MIPS: analysis and annotation of proteins from whole genomes.
Nucleic Acids Res 2004, 32(Database issue):D41D44. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dwight SS, et al.: Saccharomyces Genome Database provides secondary gene annotation using the Gene Ontology.
Nucleic Acids Research 2002, 30(1):6972. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xenarios I, et al.: DIP: the Database of Interaction Proteins: a research tool for studying cellular networks of protein interactions.
Nucleic Acids Res 2002, 30:303305. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cho YR, Hwang W, Ramanmathan M, Zhang AD: Semantic integration to identify overlapping functional modules in protein interaction networks.
BMC Bioinformatics 2007, 8:265. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Luo F, Li B, Wan XF, Scheuermann RH: Core and periphery structures in protein interaction networks.
BMC Bioinformatics 2009, 10:S8. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dezso Z, Oltvai ZD, Barabasi AL: Bioinformatics Analysis of Experimentally Determined Protein Complexes in the Yeast Saccharomyces cerevisiae.
Genome Research 2003, 13:24502454. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gavin A, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, et al.: Proteome survey reveals modularity of the yeast cell machinery.
Nature 2006, 440(7084):631636. PubMed Abstract  Publisher Full Text

Leung H, Xiang Q, Yiu S, Chin F: Predicting protein complexes from ppi data: A coreattachment approach.
Journal of Computational Biology 2009, 16(2):133144. PubMed Abstract  Publisher Full Text

Radicchi F, Castellano C, Cecconi F: Defining and identifying communities in networks.
PNAS 2004, 101(9):26582663. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jeong H, et al.: The LargeScale Organization of Metabolic Networks.
Nature 2000, 407:651654. PubMed Abstract  Publisher Full Text