Abstract
Background
Recent advances in proteomic technologies have enabled us to create detailed proteinprotein interaction maps in multiple species and in both normal and diseased cells. As the size of the interaction dataset increases, powerful computational methods are required in order to effectively distil network models from largescale interactome data.
Results
We present an algorithm, miPALM (
 M
 I
 P
 L
 M
Conclusions
miPALM is a novel algorithm for detecting protein complexes from large proteinprotein interaction networks with improved accuracy than previous methods. The software is implemented in Matlab and is freely available at http://www.medicine.uiowa.edu/Labs/tan/software.html webcite.
Background
Protein complexes carry out the majority of biological processes within a cell. Correctly identifying protein complexes in an organism is useful for deciphering the molecular mechanisms underlying many cellular functions. Recent advances in proteomics technologies such as twohybrid system and mass spectrometry has allowed enormous amount of data on proteinprotein interactions (PPI) to be released into the public domain [1]. As the amount of global high throughput protein interaction data keeps increasing, methods for accurately identifying protein complexes from such data become a bottleneck for further analysis of the resulting interactome.
There is a large body of research on computational methods for de novo protein complex detection in PPI networks. These methods can be roughly divided into three categories. Methods in the first group define explicit complex criterion such as dense connectivity within a complex. A heuristic search strategy is then employed to identify complexes [24]. In contrast, the second group of methods also define a complex criterion but use complete enumeration to find all complexes that satisfy the criterion [57]. Instead of using local search strategy, the third group of methods are based on global graph partitioning techniques [811]. For instance, maximization of the modularity (Q) measure proposed by Newman and Girvan [12] has been successfully applied to PPI networks [11]. However, the global modularity measure has an inherent resolution limit for detecting small subnetworks [13], such as protein complexes whose median size is fewer than 10 proteins per complex. The reason for this resolution limit is that global modularity uses the entire network to compute the expected connectivity within a set of proteins, which may not be an appropriate measure of the background around protein complexes. Muff et al. [9] introduced a local version of the modularity measure (LQ) by only considering the immediate neighbors of a complex instead of the entire network. Applying it to the PPI network of E. coli, they showed that LQ was better at identifying small but biologically meaningful protein complexes.
Q and LQ represent two extremes of the neighborhood measure used to estimate background connectivity in a random network. Neither may be optimal for a given PPI network. In this study, we introduce a tunable parameter into the original formulation of modularity to help determine the optimal neighborhood size in calculating expected connectivity of a set of proteins. Another drawback of the previous LQ approach is that the computationally expensive optimization technique, simulated annealing, was used to maximize LQ, which is not feasible for large PPI networks such as yeast or human networks although it was proven useful for the smaller E. coli PPI network.
In this paper we introduce a novel algorithm to infer protein complexes by combining a parametric local modularity measure and a greedy search strategy. We evaluate our approach on the yeast PPI networks using two reference sets of protein complexes and additional functional annotations of yeast proteins. Compared to four existing methods, our algorithm achieves a significantly performance improvement in terms of Fmeasure and biological relevance of predicted complexes. By applying our method to two largescale PPI networks, we predict a set of 138 novel protein complexes in the baker's yeast S. cerevisiae that warrant future experimental characterization.
Results
Local Modularity with Coarseness Parameter Improves Complex Prediction
Previously, global (Q) [11] and local modularity (LQ) [9] have been proposed as a measure to detect protein complexes in large PPI networks. However, both measures have their drawbacks. The global modularity measure has an inherent resolution limit for small subnetworks such as protein complexes [13]. The local modularity measure only considers first neighbors of a subnetwork, which might not provide enough information for estimating the true background connectivity pattern of a random network. In this paper, we propose a new local modularity measure, LQa (parametric local modularity with the coarseness parameter α) for inferring protein complexes in large PPI networks. To compare how effective the three measures are to detect protein complexes, we first implemented three complex detection algorithms using a common greedy search strategy and each of the three modularity measures as the scoring function. We used the yeast full PPI network from the DIP database [14] and two sets of gold standard protein complex annotations (see Methods). As shown in Figure 1, LQa performed the best in terms of Fmeasure when evaluated using both gold standard sets. Note that Q and LQ have no coarseness parameter to set and the sets of predicted complexes are the same for the two sets of known annotations. For LQa we set the coarseness parameter to yield the best Fmeasure for each set of known complexes.
Figure 1. Performance comparison of three modularity measures. The yeast DIP full network was used as input. Optimal coarseness parameter, a, was optimized on three known complex sets separately.
The number and average size of the predicted complexes are listed in Table 1. As expected, Q found a very small number of complexes with a large number of members, which caused a low recall rate and Fmeasure. LQ further resolved those large subnetworks into a number of smaller ones. However, the average size of the predicted complexes (37.5) was still much larger than the average size of known complexes (< 10). In contrast, LQa found a reasonable number of complexes in the same size range as the known complexes.
Table 1. Number and average size (arithmetic mean, in parenthesis) of predicted complexes using three different modularity measures and the DIP PPI network as input.
Putting All Together: the miPALM Algorithm
We introduce a novel algorithm, miPALM (module inference by Parametric Local Modularity), for inferring protein complexes from largescale protein interactome data. The input to miPALM consists of an unweighted PPI graph and two parameters, α and δ. The algorithm has three major steps. Algorithmic details of each step and the corresponding pseudocode are described in the Methods section. We briefly describe the major steps of the algorithm here. First, from the input PPI network, miPALM identifies a set of triangle seeds using topological overlap measure. A pair of nodes in a network has high topological overlap if they are both strongly connected to the same group of nodes (see Methods). Therefore, the use of topological overlap measure serves to exclude spurious or isolated connections in the network. Second, from each seed, the algorithm uses a greedy search to expand it into candidate complex(es). Local modularity is used as a scoring function to assess the quality of a candidate complex. The parameter α is used to control the background neighborhood size around a candidate complex. Finally, a filtering step is performed on the set of candidate complexes based on their density scores which is controlled by the parameter δ. The complete algorithm for complex prediction is shown in Algorithm 4.
Performance Comparison with Existing Methods
Next, we compare the performance of our algorithm with four representative algorithms for protein complex prediction, MCODE [2], MCL [10], COACH [15], and DME [7]. MCODE relies on the concept of Kcore (a subgraph in which all nodes have a degree at least k) and greedy search. MCL is a global graph partitioning algorithm that works by simulating stochastic flows in a graph. COACH is conceptually similar to MCODE. It first identifies the core of a candidate complex (maximal set of connected vertices whose degrees are greater than the network average) and then expand the core by including additional nodes if more than 50% of their edges are shared with the core. DME detects all node subsets that satisfy a userdefined minimum density threshold in a greedy fashion. Of the five algorithms, MCL cannot detect overlapping complexes whereas MCODE, COACH, DME, and miPALM can. Additionally, MCL is a global graph partitioning method whereas the other four are based on seeding and local search.
We tested the performance of all five methods using two sets of known complexes in the baker's yeast, S. cerevisiae. CYC08 is a set of protein complexes manually curated from published smallscale studies [16]. Since most smallscale studies tend to be biased towards complexes involved in a limited number of cellular processes, to complement this set, we also used the YHTP08 set of protein complexes [16]. It was constructed by analyzing two recent and most comprehensive genomewide protein complex screens based on affinity purification coupled with mass spectrometry experiment [17,18].
For performance comparison we determined the optimal parameters for each algorithm to achieve the highest Fmeasure, given a gold standard set (see Methods). The comparison results are presented in Figures 2 and 3 and Table 2. For each method, we report the precision, recall, and Fmeasure. As can be seen in Figure 2A, both COACH and miPALM achieved a much higher Fmeasure compared to the other three methods. The average Fmeasure was 0.42, 0.39, 0.23, 0.16, and 0.12 for COACH, miPALM, MCL, MCODE, and DME, respectively.
Figure 2. Performance comparison of five complex detection algorithms. A) Fmeasure of the five algorithms with their best parameters optimized to two sets of known complexes CYC08 and YHTP08 using the DIP network as input. B) Precision and recall of the methods. Circles, Fmeasures measured against CYC08 (C) or YHTP08 (Y). Lines, Fmeasure contours. Two different points on the precisionrecall plane can have the same Fmeasure values if they lie on the same Fmeasure isoline. The average Fmeasure of an algorithm, over the two gold standard sets, is indicated with a star.
Figure 3. Biological relevance of predicted complexes by five algorithms with optimized parameters and the DIP PPI network as input. A) Percentage of complexes enriched for at least one GO term. B) Percentage of complexes whose members are colocalized in the same subcellular compartment.
Table 2. Statistics of predicted complexes by five algorithms with the best parameters optimized on CYC08 and YHTP08 sets and the DIP PPI network as input.
Figure 2B shows a breakdown of the Fmeasure into precision and recall for all five methods. On average, MCL achieved the highest recall mainly due to its large number of predictions. On the other hand, MCODE achieved the highest precision because it tends to identify a subset of known complexes with higher overlap than other methods. However, the overall accuracy of both methods (as measured by the Fmeasure) was lower than those of COACH and miPALM because MCL had a much lower precision and MCODE had a much lower recall. In other words, the higher Fmeasure achieved by COACH and miPALM is due to a balanced increase in both their recall and precision.
Although Fmeasure is a popular metric for evaluating the performance of a complex predictor, it is not the only one. Biological relevance is also an important indicator of the quality of predicted complexes. Accordingly, we next conducted GO term enrichment and colocalization analyses to determine the biological relevance of the predicted complexes. Genomewide protein localization data has been reported for Baker's yeast using fluorescent imaging [19]. For each predicted complex, we calculated a logodds score that measures the extent to which members of the complex colocalize to the same subcellular compartments (see Methods). Compared to the Fmeasure that relies on an incomplete gold standard set, both GO term and colocalization annotations used here are more comprehensive and thus complementary to the Fmeasure.
At a pvalue of 0.05, our set of predictions had the highest fractions of complexes with enriched functional categories (Figure 3A). Compared to the second best performer (MCODE), the average increase in the fraction of enriched complexes was 8.9% across the two gold standard sets of complexes. For complex member colocalization, our predictions had an 18.8% average increase compared to the second best performer, DME (Figure 3B).
Taken together, our benchmarking analyses demonstrated that miPALM achieved the second highest Fmeasure (3% lower than COACH) when evaluated using known complexes. On the other hand, miPALM outperforms all other algorithms by a large margin (8.9% and 18.8%) when evaluated using functional annotations of complex members.
Novel Complex Predictions Using Large Yeast PPI Networks
Next, we applied miPALM to discover novel protein complexes in two largescale yeast PPI networks based on interactions obtained from the BioGRID database [20]. The first network consists of all yeast interactions in the BioGRID database. The majority of interactions are derived from high throughput experiments. The second network consists of highconfidence interactions derived by filtering the BioGRID interactions based on their lines of supporting evidence [21]. For brevity's sake, these two networks are termed BioGRID and HC networks in this paper. The BioGRID network contains 5591 proteins and 51880 physical interactions and the HC network contains 2228 proteins and 6209 physical interactions. By studying two networks with different amount of noise, we can assess the robustness of our method on noisy data.
To predict complexes, we set the coarseness parameter α to be 0.364 that gave the highest Fmeasure as described in the performance comparison section.
In total, miPALM predicted 168 and 208 protein complexes from the BioGRID and HC network, respectively. The respective Fmeasures for the two sets of predictions are 0.31 and 0.52 (Figure 4A). As expected, predictions using HC network has a higher Fmeasure due to the higher quality of the input data. Nevertheless, as shown in Figure 5, the two sets of complexes overlap by 33.3% (56/168). To assess the significance of the overlap, we also used the other four methods in the benchmarking study to predict complexes in the BioGRID and HC networks. We used the same optimized parameters for each method as described in the performance comparison section. The two sets of complexes predicted by COACH had the highest overlap of 43.3%. The average overlap for the four methods was 26.6%. As an additional check, we considered miPALM predictions using the DIP networks as input. The average overlap between the three sets of predictions is 38.3% (Figures S6, S7 in Additional file 1). Taken together, the high level of overlap between miPALM predictions suggests that it is fairly robust against noisy data.
Figure 4. Fmeasure and biological relevance of predicted complexes using two largescale PPI networks as inputs. A) Fmeasures. Circles, Fmeasures measured against CYC08 (C) or YHTP08 (Y) using the BioGRID network as input. Squares, Fmeasures measured against CYC08 (C) or YHTP08 (Y) using the HC network as input. B) Fraction of predicted complexes enriched for GO term and colocalization.
Figure 5. Overlap between miPALM predicted complexes using two largescale PPI networks as inputs. A) Overlap between protein members in the predicted complexes. B) Overlap between complexes. C) Overlap between novel complexes. Blue, predicted complexes using HC PPI network; Red, predicted complexes using BioGRID PPI network.
Additional file 1. Supplemental Methods and Materials.
Format: PDF Size: 493KB Download file
This file can be viewed with: Adobe Acrobat Reader
After merging overlapped complexes, we ended up with 322 predicted complexes from the two networks. Two hundred thirty two of these complexes (72.5%) are enriched for at least one GO term (Table 3), suggesting many of them are true protein complexes. Examined separately, 109 (64.9%) BioGRID and 173 (83.2%) HC predictions are enriched for at least one GO term, respectively (Figure 4B).
Table 3. Supporting evidence for novel complexes predicted by miPALM compared to gold standard sets of known complexes.
To further corroborate our predictions, we next used a genomewide protein localization data set to examine if members of our predicted complexes tend to colocalize in the same subcellular compartments. For each of our predicted complex, we calculated a colocalization logodds score that compares the member colocalization probability of a predicted complex to the probability of the same number of random proteins in the PPI network (See Methods). For the set of 320 predicted complexes, 208 (65.0%) are enriched for at least one subcellular compartments (Table 3). Examined separately, 115 (68.5%) BioGRID and 123 (62.0%) HC predictions are enriched for at least one subcellular compartment, respectively (Figure 4B).
To identify new complexes in our prediction, we used the union of CYC08 and YHTP08 as the set of known complexes. After filtering those complexes matching any of the known complexes, we were left with 138 novel protein complexes. To evaluate the quality of these novel protein complexes, we computed the fraction of complexes that have enriched GO functional terms or are colocalized to the same subcellular compartments. Eight five (61.6%) of the novel complexes were enriched for at least one GO terms and 95 (68.8%) complexes were enriched for at least one subcellular compartments (Table 3). The fraction of GO term enriched complexes was comparable to known complexes. Remarkably, the fraction of colocalized complexes in our prediction was much higher than those of the two gold standard sets (Table 3). These results provide further evidence that the set of novel complexes are true protein complexes. Information about the complete set of predicted complexes with supporting evidence is reported in Additional files 1, 2, 3 and 4.
Additional file 2. Supplemental Table S1.
Format: XLS Size: 62KB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional file 3. Supplemental Table S2.
Format: XLS Size: 71KB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional file 4. Supplemental Table S3.
Format: XLS Size: 38KB Download file
This file can be viewed with: Microsoft Excel Viewer
Discussion
The global modularity measure proposed by Newman and Girvan [12] identifies clusters (subnetworks) in a network by comparing the observed fraction of edges inside a cluster to the expected fraction of edges in the cluster. In doing so, it assumes that connections between all pairs of nodes in the network are equally probable, which reflects all connectivity among all clusters. However, in many molecular interaction networks, most subnetworks are only connected locally. For instance, in metabolic networks, major pathways occur as clusters that are sparsely linked among each other [22]. The same observation can also be made on protein complexes [23].
In this study, we introduced parametric local modularity as a new measure for the quality of clusters in a network. It takes into account local cluster connectivity and overcomes global network dependency. As an analogy, the coarseness parameter functions as the resolution dial of a microscope. By changing the value of the coarseness parameter, we can adjust the size of the cluster neighborhoods when calculating the expected fraction of edges within a cluster. Since different biological networks might have distinct neighborhood connectivity, a tunable local modularity measure allow us to best estimate the local neighborhood connectivity by changing the size of the neigbhorhood under consideration.
Protein complexes are dynamic molecular entities. Depending on the cellular states, membership of a protein complex could change and different complexes could have shared members [18]. Our algorithm can detect overlapping complexes if during the seed expansion step seeds of different candidate complexes are close enough.
The Fmeasure used for performance evaluation is a popular approach. A drawback of Fmeasure is that it cannot distinguish whether a predicted complex overlap with just one or multiple known complexes and vice versa. It has been argued that predictions that overlap with fewer known complexes should be regarded as having a higher quality [24]. To further evaluate the methods using this criterion, we use the separation metric introduced by Brohee and van Helden [24] which takes into account the observation above. As shown in Figure S8 (Additional file 1), miPALM again outperforms the other methods. Therefore, it is unlikely that the performance improvement by miPALM is due to a bias in the benchmarking metrics used.
In summary, using three alternative performance measures (Fmeasure, Biological Relevance, Separation), our benchmarking analysis demonstrate that miPALM achieve an overal best performance among the five algorithms compared. The performance measures of the methods using three input interaction networks are summarized in Additional file 1, Tables S4, S5, S6.
The proposed algorithm can be naturally extended to handle weighted networks by using edge weights for local modularity calculation. Edge weights can be calculated based on topological features of the PPI network and domainspecific information from other omic data, such as microarray gene expression, genomewide association study, and genomewide sequence mutation data (e.g. cancer mutation screening). Integration of functional genomic data into miPALM will enable us to find contextdependent subnetworks that are active under specific growth conditions.
Conclusions
Using several performance measures (Fmeasure, Biological Relevance, and Separation), we have demonstrated that miPALM achieved an overall improvement over previous algorithms. miPALM combines the strength of three key features, triangle seed identification using topological overlap measure, parametric local modularity as a cluster quality measure, and recursive greedy search. By including functional genomic data as edge weights, miPALM can be extended to identify contextdependent gene modules that can in turn be used to assist in network comparison and classification tasks.
Methods
Protein interaction and complex data
Protein interaction networks
Yeast proteinprotein interaction data were downloaded from the DIP [14] and BioGRID [20] databases. The DIP "full" set of PPIs (including all physical interactions in the DIP database instead of a subset of high confidence interactions) were used for algorithm development and comparison. The BioGRID and highconfidence [21] sets of PPIs were used for novel protein complex prediction. After removing selfloops and multiple edges, the three networks contain 4859, 5591, and 2228 proteins and 17138, 51880, and 6209 interactions, respectively.
Known annotated protein complexes
Two sets of annotated protein complexes were used for performance evaluation. Pu et al. generated a comprehensive catalogue of 408 protein complexes manually curated from published smallscale experiments reported as of 2008 [16]. This set provides an update of the widely used goldstandard MIPS complexes. In the same study, they also generated a catalogue of 400 highthroughput complexes by a systematic analysis of all high throughput proteinprotein interaction data reported as of 2008. After removing complexes with fewer than 3 members, we ended up with two reference sets of protein complexes, termed CYC08 (236 complexes) and YHTP08 (207 complexes), respectively.
Construction of the seed set
Seeding strategy is crucial for a network searching algorithm since the search result is dependent on the starting point (e.g. a node, an edge, or a subnetwork). Here we describe how to construct seeds and to rank them based on the local property of the network.
First, we weight every interaction in the PPI network. For discovering good seeds, it is important to rank withincomplex edges high and betweencomplex edges low. We used a modified version of the topological overlap measure by Ravasz et al. [25] as edge weight. It is defined as following:
where Γ(v, w) is the number of common neighbors of node v and w, k_{v }and k_{w }are the degrees of node v and w, A_{vw }= 1 if v and w have a direct link and zero otherwise.
In the original definition of O_{T }(v, w), the number of shared interacting partners is normalized by dividing Γ(v, w) by min(k_{v}, k_{w}) instead of (k_{v }+ k_{w})/2. We modified the normalization factor because it is improper to treat two proteins topologically equal if one protein has three interactors and the other has 100 interactors (e.g. hub proteins) even though these two proteins share the same three interacting partners.
Second, we enumerated all triangles in the PPI network using the enumeration algorithm described in Algorithm 1. All triangles in the PPI network can be located by Algorithm 1 in O(k_{max}·m) time with an upper bound of O(n·m), where k_{max }is the largest node degree in the network.
Algorithm 1: TriangleEnumeration (G)
1 input: Unweighted graph G = (V, E)
2 output: all triangles of G
3 begin
4 for e ∈ E do
5 (v, w) ← a pair of nodes connected by e
6 Γ (v, w) ← a set of common nodes shared by v and w
7 for x ∈ Γ(v, w) do
8 output triplet {v, w, x}
9 remove e from G
10 end
We then rank all triangles found by Algorithm 1 based on their triangleweights obtained by averaging pairwise edge weights.
Local modularity as the scoring function
The total modularity Q of a network with M modules is defined as following [12]:
where m is the total number of edges in the network, m_{ss }is the number of intramodule edges in module S, and d_{s }is the sum of the degrees of nodes in module S. Essentially, Q is the difference in the fraction of withinmodule edges between the observed network and a random configuration network model. This definition of modularity is global in the sense that the comparison of m_{ss}/m with (d_{s}/2m)^{2 }assumes equal probability of connection between any pair of nodes in the random network model.
During module search, when a node v and a subnetwork S are merged, the change in global modularity can be derived, as followings,
where Q_{v }and Q_{S }are the modularity of v and S, respectively and Q_{vS }is the modularity of the subnetwork created by merging v and S.
In order to overcome the resolution limit of the global modularity measure, Muff et al. proposed the local modularity measure LQ [9]
where m_{ss }is the number of edges within subnetwork S and m_{s }is the total number of edges in S and its first neighbours. LQ is based on the observation that in real world networks most subnetworks are only connected to a small fraction of the entire network.
Inspired by previous work, we introduce a new local modularity measure for a single subnetwork as defined below:
where the denominator of the second term in Eq. 4 is not fixed to 2m, but varied with a parameter α that we call the coarseness parameter.
After merging v and S the change in the newly defined local modularity is then:
Readers are referred to the Suppl. Methods (Additional file 1) for detailed derivation of ΔLQ_{α }from LQ_{α}
When α = 1, ΔLQα is equivalent to ΔQ in Eq. 3. Decreasing α leads to a smaller number of edges to be considered. For example, if α = 0.5, the ratio of considered edges to the total number of edges in the network (i.e. edgecoverage ratio, r= 2m^{α }/2m )) is m^{1/2}. Conversely, if we want to cover locally 50% of edges (r = 0.5), then α can be set to 1+log_{m}(0.5). As α goes down to zero, the size of the detected subnetwork becomes smaller and smaller because the expected fraction of withinmodule edges, the second term in Eq. 5, becomes larger. Suppl. Figure S1 (Additional file 1) shows the edgecoverage ratio and size of resultant detected subnetworks as a function of α.
Greedy search by maximizing local modularity measure
The problem of finding a network partition with maximum global modularity is known to be NPhard [26]. Thus, various heuristic approaches were proposed [2732]. In particular, greedy search [31,32] based on global modularity have been studied extensively due to its single peakness [33] and fast speed for analyzing very large networks.
Our scoring function (Eq. 5) made it possible to adopt a greedy search strategy to expand a given triangle seed to a larger subnetwork iteratively until the increase in local modularity becomes negative. Pseudo codes for our greedy search algorithm are shown in Algorithms 2 and 3. Briefly, starting with the top ranked triangle seed {x, y, z}, our greedy algorithm always merge the direct neighbor w of the seed that increases local modularity the most, growing the seed into a larger subnetwork S={w, x, y, z}. The algorithm outputs S if it has no additional neighbor merging of which leads to an increase in the local modularity. This searching process (or seed expansion) is then repeated with a new seed. The timeconsuming step of the greedy search algorithm is the calculation of ΔLQ_{α }after each merging. We avoid recalculating ΔLQ_{α}(v, S') for all neighbours of S', v∈N_{s' }by taking advantage of the recursive relationship for ΔLQ_{α }between before and after merging (see Suppl. Methods and Figure S3 for details, Additional file 1). The upper bound for the time complexity of our search algorithm is O(n_{s}·d_{s}) where n_{s }is the number of proteins in the subnetwork S and d_{s }is the sum of degrees of all nodes in the subnetwork S.
Algorithm 2: RecursiveGreedySearch (S, A, α)
1 input: triangle seed S, adjacency matrix A, and coarseness parameter α
2 output: Expanded subnetwork S^{' }and its neighbor nodes N_{s'}
3 begin
Ns ← neighbor nodes of S
5 ΔLQ_{α }(·, S) ← change in our local modularity for all v in N_{s}
6 if max((ΔLQ_{α }(·, S)) < 0 then
7 return S and N_{s}
8 [S', N_{s'}] ← GrowSeed(S, A, N_{s}, α, ΔLQ_{α }(·, S))
9 return S' and N_{s'}
10 end
Algorithm 3: GrowSeed (S, A, N_{s}, α, ΔLQ_{α }(·,S))
1 input: triangle seed S, adjacency matrix A, a set of neighbor nodes of S N_{s}, coarseness parameter α, change in local modularity ΔLQ_{α }(v, S) for all v in N_{s}
2 output: Expanded subnetwork S^{' }and its neighbor nodes N_{s'}
3 begin
5 N_{v* }← all neighbor nodes of v*
6 S' ← {S, v*}
10 if max (ΔLQ_{α }(·,S')) < 0 then
11 return S' and N_{s'}
13 end
Elimination of unpromising seeds
Unpromising seeds are those that cannot be expanded into larger subnetworks. In other words, they are triangles that have no neighbors that can cause positive change in local modularity if merged. We filtered out those triangles after seed expansion step to speed up the algorithm and reduce the number of false positives (see Figure S2 in Additional file 1).
Complex merging
Proteins in a PPI network could belong to one or more protein complexes simultaneously. This multiple membership of proteins should be uncovered by the clustering algorithm. Complexes found by our method can be overlapped if they are within the same densely connected region in the PPI network. While revealing overlapped complexes is important for understanding their dynamics, allowing algorithm to make overlapped predictions often produce an excessive number of complexes. For example, the algorithm DME [7] predicted 14,780 complexes (minimum density threshold 0.95) on the yeast DIP full set. The majority of them are overlapped, causing low precision and poor overall performance. In this paper we merged any two complexes S and T if they have an overlap score of greater than 0.5, which is defined as S ⋂ T/min(S, T).
Complex filtering by density score
After merging complexes produced by the seed expansion step, we rank the candidate complexes by their density score δ_{s }that is defined as the product of the connectivity and size of complex .
The miPALM algorithm
Our algorithm takes as input an unweighted PPI network Gn, m={V, E} with n nodes and m edges and outputs a set of predicted protein complexes, M. The pseudo code of the algorithm is shown in Algorithm 4.
Algorithm 4: miPALM (G, α, δ)
1 Input: Unweighted graph Gn, m ={V, E}, n=/V/, m=/E, coarseness parameter α, and density score threshold δ
3 Output: a set of subnetworks, M
4 begin
5 T ← TriangleEnumeration (G)
6 t ← choose the top ranked triadseed in T
7 T ← delete t from list T
8 while T is not empty do
9 S ← RecursiveGreedySearch (t, A, α)
10 t ← choose the top triadseed uncovered by the previous search
11 T ← delete t from list T
12 if the size of S is three then
13 continue
14 S ← refine S by looking around S
15 M ← {M, S}, output S
16 S ← merge subnetworks in S
17 for S ∈ M do
18 δ_{s }← get density score f S
19 if δ_{s }< δ then
20 delete S from M
21 end
Performance evaluation
We used the Fmeasure to evaluate the performance of complex prediction algorithms. Fmeasure is the harmonic mean of the two quantities, precision (Pre) and recall (Rec), 2 Pre Rec/(Pre + Rec). Precision is defined as the ratio of the number of matched subnetworks to the number of predicted subnetworks by each algorithm. Recall is the ratio of the number of matched subnetworks to the number of known complexes.
For comparison purpose, we used the complex matching criterion used in MCODE [2] to identify predicted complexes that overlap with gold standard complexes. A predicted subnetwork is considered matched to a known complex if it has a matching score of 0.2 or greater. Matching score is defined as ω = c^{2}/a·b, where a, b are the size of the subnetwork and the known complex, respectively, and c is the number of protein members overlapped between the prediction and the known complex. We also examine the precision and recall rates at different overlap scores (see Figure S9 in Additional file 1).
Parameter selection
Our algorithm has two parameters, α for determining the size of the local neighborhood of a candidate complex and δ for filtering candidate complexes based on their density score. For benchmarking purpose, we used the Fmeasure to determine the parameters yielding the best performance of the algorithm on three sets of known complex. Because the δ parameter is only used for postsearch filtering, we first searched for the optimal α value. We varied α from 0 to 1 with an initial step size of 0.01. Once the range of optimal α value was located, we further searched for the optimal parameter value using a finer step size of 0.001 (Figure S4 in Additional file 1). After an optimal α was found, we determined the optimal δ by searching from 0 to 3.5 with a step size of 0.01. To determine the sensitivity of the algorithm to parameter changes, we determined the overlaps between predicted complexes using two α values differed by 0.01. As can be seen in Figure S5 (Additional file 1), our algorithm is not overly sensitive to parameter changes.
For the other four programs we compared, we tested the following parameter ranges that gave optimal Fmeasure on the three sets of known complexes. For COACH, the affinity threshold was varied from 0 to 1 with a step size of 0.01. For MCL, the inflation parameter was varied from 1.2 to 5.0 with a step size of 0.01. For DME, the density threshold parameter was varied from 0.91 to 1.0 with a step size of 0.01. For MCODE, vertex weight percentage = 0.2, haircut = TRUE, and fluff = FALSE were used. These parameters of MCODE have been optimized to produce the best results by default.
Gene ontology term enrichment test
Yeast Gene Ontology (GO) slim terms were used to evaluate the biological relevance of predicted complexes. Pvalue for GO term enrichment was calculated using the hypergeometric distribution. A Bonferronicorrected pvalue of 0.05 is considered to be significant.
Colocalization analysis
Based on fluorescence imaging, Huh et al. [19] classified 75% of the yeast proteome into 22 distinct subcellular compartments. Protein localization data was downloaded from the yeast GFP fusion localization database http://yeastgfp.yeastgenome.org webcite. To compute a logodds score of complex subcellular localization, we compared the observed number of protein pairs within a subnetwork S that are colocalized to subcellular compartment k (m_{sk}) to the expected number of such pairs in a random network , defined as following,
and
where n_{sk }is the number of proteins localized in compartment k in subnetwork S and p_{s }is the connectivity for the subnetwork. We consider a complex to be localized to a compartment k if the logodds score .
Authors' contributions
JK and KT conceived and designed the study. JK performed the experiments. JK and KT analyzed the data. All authors have read and approved the final manuscript.
Acknowledgements
We thank the anonymous reviewers for their helpful comments. This work is supported by the American Cancer Society [7700431 to K.T.] and the Pharmaceutical Research and Manufacturers of America Foundation [to K.T.].
References

Beyer A, Bandyopadhyay S, Ideker T: Integrating physical and genetic maps: from genomes to interaction networks.
Nat Rev Genet 2007, 8:699710. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks.
BMC Bioinformatics 2003, 4:2. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Everett L, Wang LS, Hannenhalli S: Dense subgraph computation via stochastic search: application to detect transcriptional modules.
Bioinformatics 2006, 22:e117123. PubMed Abstract  Publisher Full Text

Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks.
Bioinformatics 2006, 22:10211023. PubMed Abstract  Publisher Full Text

Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society.
Nature 2005, 435:814818. PubMed Abstract  Publisher Full Text

Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks.
Proc Natl Acad Sci USA 2003, 100:1212312128. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Georgii E, Dietmann S, Uno T, Pagel P, Tsuda K: Enumeration of conditiondependent dense modules in protein interaction networks.
Bioinformatics 2009, 25:933940. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

PereiraLeal JB, Enright AJ, Ouzounis CA: Detection of functional modules from protein interaction networks.
Proteins 2004, 54:4957. PubMed Abstract  Publisher Full Text

Muff S, Rao F, Caflisch A: Local modularity measure for network clusterizations.
Phys Rev E 2005, 72:056107. Publisher Full Text

Girvan M, Newman ME: Community structure in social and biological networks.
Proc Natl Acad Sci USA 2002, 99:78217826. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Newman MEJ, Girvan M: Finding and evaluating community structure in networks.
Physical Review E 2004, 69:026113. Publisher Full Text

Fortunato S, Barthelemy M: Resolution limit in community detection.
Proc Natl Acad Sci USA 2007, 104:3641. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update.
Nucleic Acids Res 2004, 32:D449451. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu M, Li X, Kwoh CK, Ng SK: A coreattachment based method to detect protein complexes in PPI networks.
BMC Bioinformatics 2009, 10:169. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Pu S, Wong J, Turner B, Cho E, Wodak SJ: Uptodate catalogues of yeast protein complexes.
Nucleic Acids Res 2009, 37:825831. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, et al.: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.
Nature 2006, 440:637643. PubMed Abstract  Publisher Full Text

Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, et al.: Proteome survey reveals modularity of the yeast cell machinery.
Nature 2006, 440:631636. PubMed Abstract  Publisher Full Text

Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK: Global analysis of protein localization in budding yeast.
Nature 2003, 425:686691. PubMed Abstract  Publisher Full Text

Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets.
Nucleic Acids Res 2006, 34:D535539. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD, Tyers M: Still stratus not altocumulus: further evidence against the date/party hub distinction.
PLoS Biol 2007, 5:e154. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guimera R, Nunes Amaral LA: Functional cartography of complex metabolic networks.
Nature 2005, 433:895900. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization.
Nat Rev Genet 2004, 5:101113. PubMed Abstract  Publisher Full Text

Brohee S, van Helden J: Evaluation of clustering algorithms for proteinprotein interaction networks.
BMC Bioinformatics 2006, 7:488. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks.
Science 2002, 297:15511555. PubMed Abstract  Publisher Full Text

Brandes U, Delling D, Gaertler M, Gorke R, Hoefer M, Nikoloski Z, Wagner D: On modularity clustering.
Ieee Transactions on Knowledge and Data Engineering 2008, 20:172188. Publisher Full Text

Agarwal GaK D: Modularitymaximizing graph communities via mathematical programming.
The European Physical Journal BCondensed Matter and Complex Systems 2008, 66:409418. Publisher Full Text

Brandes U, Delling D, Gaertler M, Gorke R, Hoefer M, Nikoloski Z, Wagner D: On Finding Graph Clusterings with Maximum Modularity.
GraphTheoretic Concepts in Computer Science 2007, 121132. Publisher Full Text

Noack AaR R: Multilevel Algorithms for Modularity Clustering. In Experimental Algorithms. Volume 5526. Heidelberg: Springer; 2009::257268. Publisher Full Text

SalesPardo M, Guimera R, Moreira AA, Amaral LA: Extracting the hierarchical organization of complex systems.
Proc Natl Acad Sci USA 2007, 104:1522415229. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Newman MEJ: Fast algorithm for detecting community structure in networks.
Physical Review E 2004, 69:066133. Publisher Full Text

Schuetz P, Caflisch A: Multistep greedy algorithm identifies community structure in realworld and computergenerated networks.
Physical Review E 2008, 78:026112. Publisher Full Text

Clauset A, Newman MEJ, Moore C: Finding community structure in very large networks.
Physical Review E 2004, 70:066111. Publisher Full Text