Introduction
Biomolecules interact with each other in a complex modular manner to maintain normal cellular functionalities 12. Identifying recurrent functional modules may help better understand the functional organization of cells 3. Many existing network clustering algorithms for functional module identification focus on identifying "modules" within which nodes are densely connected 456. However, identifying modules using these existing computational approaches may have artificially introduced biases from their module definitions and corresponding optimization methods 2. Topologically defined modules may simply originate from the evolution process but do not necessarily correspond to functional units in cells 7. In addition to densely selfconnected modules, there are other topological structures in biological networks which capture important functional relationships among biomolecules. For example, transmembrane proteins, including receptors in many signal transduction pathways, have a special structure in which they rarely interact with each other but have a similar interaction patterns with the rest of the network 2. To identify functional modules with richer topological structures, blockmodel network clustering recently has been proposed for functional module identification in biological networks 2489.
Blockmodel module identification problem has been investigated for years 2710 and has recently been used for blockmodeling functional modules within biological networks 2. However, the resulting optimization problem is NP hard with highly nonlinear and nonconvex properties with many local optima in its objective function. This makes it computationally prohibitive to obtain the optimal modules, especially for largescale biological networks. Simulated Annealing (SA) algorithm has been used to solve the optimization problem 2. However, a slow cooling down procedure is required to guarantee the solution quality. Furthermore, its computational time escalates quadratically with the increasing number of modules to identify. Therefore, more efficient algorithms are needed for discovering finegrained functional modules in genomescale biological networks.
In this paper, an efficient optimization methodsubgradient with path generation(SGPG) is proposed to solve this difficult nonconvex combinatorial optimization problem. In order to achieve results close to global optima, SGPG combines the convex programming method, which uses subgradient (SG) to efficiently obtain the local optima, and a heuristic path generation (PG) strategy, which makes use of the obtained local optima to search for better solutions. We have applied our SGPG as well as SA for functional module identification in two largescale proteinprotein interaction (PPI) networks: Saccharomyces cerevisia (Sce) PPI network from the Database of Interacting Proteins (DIP) 11 and Homo sapien (Hsa) PPI network collected from the Human Protein Reference Database (HPRD version 9) 12. The results demonstrate that our new SGPG method achieves competitive performance numerically and biologically comparing to SA but with significantly reduced computation time. Furthermore, we have implemented SGPG and the Markov Clustering (MCL) algorithm 13 to find finegrained modules of these two PPI networks. The results reveal that SGPG can identify additional biologically meaningful modules that MCL may miss, which may provide us a better understanding of the functional organization of these PPI networks.
Blockmodel module identification
We first review the blockmodel module identification framework proposed in 2710. Any given network can be represented as a graph G
=
{
V
,
E
}
, where V
=
{
v
1
,
v
2
,
…
,
v
N
}
denotes the set of N network nodes in G
, and E
is the set of edges. The topology of the network G can be represented by an N × N adjacency matrix A, where each entry A_{ij }represents the interaction between nodes v_{i }and v_{j }. The blockmodel framework introduces the image graph M
=
{
U
,
I
}
to abstract the function roles of the nodes in the original network and to outline the primary interactions among functional modules. In the image graph M
,
U
=
{
u
1
,
…
,
u
q
}
represents the set of virtual module nodes in the module space and
I
preserves the interactions within U
. The topology of M
also can be represented by a q × q adjacency matrix B, where the entry B_{rs }denotes the interaction between modules u_{r }and u_{s}. For module identification, the mapping τ assigns N nodes in the original network G to q different modules in the image graph M .
Assuming that we know the image graph adjacency matrix B, the optimization criterion of blockmodel framework is to minimize the mismatch between A_{ij }and B_{rs }by identifying an optimal mapping τ with resepct to the error function 2:
E
(
τ
,
B
)
=
1
M
∑
i
≠
j
N
(
A
i
j

B
τ
i
τ
j
)
(
w
i
j

p
i
j
)
,
in which w_{ij }denotes the weight of the corresponding edge in G (in this paper w_{ij }= A_{ij}); M
=
∑
i
≠
j
N
w
i
j
is used to restrict E(τ, B) between 0 and 1; and p_{ij }denotes the penalty of mismatching for the corresponding absent edges, which can be determined by
p
i
j
=
Σ
k
≠
i
w
i
k
Σ
ι
≠
j
w
l
j
Σ
k
≠
l
w
k
l
10.
In (1), we note that E(τ, B) = 0 when the image graph preserves all the edges of the original network
(
B
τ
i
τ
j
=
A
i
j
)
; Otherwise, E(τ, B) >0, meaning that the image graph does not preserve all the edges in the original network. When the image graph B has a mismatch for the absent edge between nodes v_{i }and v_{j}, it introduces error p_{ij}; It contributes w_{ij } p_{ij }to the error function in (1) when it has a mismatch for the existing edge between nodes v_{i }and v_{j }. Because A_{ij }is constant, therefore, minimizing E(τ, B) is equivalent to maximizing
1
M
∑
i
≠
j
N
(
w
i
j

p
i
j
)
B
τ
i
τ
j
, which can be rewritten as
max
τ
,
B
1
2
M
∑
i
≠
j
N
(
w
i
j

p
i
j
)
(
2
B
τ
i
τ
j

1
)
by using the binary trick. With that, we can formulate the objective function as
Q
(
τ
,
B
)
=
1
2
M
∑
r
,
s
∑
i
≠
j
N
(
w
i
j

p
i
j
)
δ
τ
i
r
δ
τ
j
s
(
2
B
r
s

1
)
,
where
δ
τ
i
r
is the indicator function that takes value 1 when τ_{i }= r and 0 otherwise. In order to maximize Q(τ, B), B_{rs }should be set to "1" when its corresponding term
∑
i
≠
j
N
(
w
i
j

p
i
j
)
δ
τ
i
r
δ
τ
j
s
is larger than 0, and 0 otherwise. Hence, the optimal solutions of τ and B are naturally decomposed. The optimal image graph B is a byproduct of the optimal mapping τ , which maximizes the following objective function:
Q
*
(
τ
)
=
1
2
M
∑
r
,
s
q
∑
i
≠
j
N
(
w
i
j

p
i
j
)
δ
τ
i
r
δ
τ
j
s
⋅
The optimization problem (2) is NP hard 214. In 10, SA has been proposed to solve the optimization problem, which has the time complexity escalating with the increasing q. In our biological application, the search space for annealing parameters also increases with the increasing q. To find a large number q of functional modules in largescale networks, SA is a very timeconsuming algorithm.
Subgradient with path generation (SGPG)
We propose to speed up the blockmodel module identification problem by convex programming combined with a heuristic path generation method. The basic idea is first to use the fast subgradient (SG) convex programming method to obtain the local optima, then use path generation (PG) to search for better solutions to reach global optima. We note that PG is originally proposed in this paper as a new useful heuristic to combine with subgradient algorithms to efficiently solve the hard combinatorial optimization problem. The combination of SG (time complexity O(qN^{2})) and PG (time complexity O(q^{2}N^{2})) can dramatically reduce the computational time with competitive performance compared to SA method.
Subgradient convex programming (SG)
Blockmodel module identification in matrix form
We now reformulate the module identification problem (2) into a matrix form by introducing an assignment matrix S corresponding to the module mapping τ. To identify q nonoverlapping functional modules in G , the assignment matrix S is defined as an N × q matrix with each entry S_{ir }= 1 when v_{i }is assigned to the module u_{r }or equivalently, τ_{i }= r; and S_{ir }= 0 otherwise. In other words,
S
i
r
=
δ
τ
i
r
. Each column in S corresponds to an image graph module node in which all the assigned network nodes take the value "1". We further use W to denote the weight matrix with each entry as the corresponding edge weight w_{ij}, and P as the penalty matrix with each entry as the corresponding penalty p_{ij}. The objective function in (2) can be rewritten in the following equivalent matrix form:
Q
*
(
τ
)
=
f
(
S
)
=
S
T
(
W

P
)
S
L
1
,
in which the sum of each row in S has to be the unity and the columns of S are orthogonal to each other. In addition, if we assume that each node has to be assigned to one module, the assignment matrix S has to satisfy the normalization condition S1_{q }= 1_{N}, in which 1_{q }and 1_{N }denote the qdimensional and Ndimensional vectors of all ones. Hence, the optimal solution for the assignment matrix S lies in the space ϕ
=
{
S
∈
{
0
,
1
}
N
×
q
,
S
1
q
=
1
N
}
, we have the convex programming formulation:
min
S
:
F
(
S
)
:
=

S
T
(
W

P
)
S
L
1
s
.
t
.
S
∈
ϕ
.
Note that we have converted our maximization problem into a minimization problem for the convenience of introducing subgradient methods in convex programming 15. We denote Q = S^{T }(W  P) S with its entries
Q
r
s
=
S
r
T
(
W

P
)
S
s
, where S_{r }is the rth column of S. Again, with the optimal assignment matrix S, we can derive the topology of the image graph B: B_{rs }= 1 if Q_{rs }>0, and 0 otherwise.
Subgradient
The optimization problem (4) is a nonsmooth combinatorial optimization problem as the objective function involves the L_{1 }norm of the matrix Q. To solve this hard optimization problem, we first relax the binary constraints S
∈
{
0
,
1
}
N
×
q
in (4) by the continuous relaxation S
∈
[
0
,
1
]
N
×
q
and use γ to represent the relaxed constraint set, which is a convex hull after relaxation. To address the nonlinearity of the matrix L_{1 }norm objective function F
(
S
)
=

Q
L
1
with the relaxed linear constraints, we propose to use FrankWolfe algorithm 15 to iteratively solve the following optimization problem with a linear objective function from the approximation by the firstorder Taylor expansion:
min
S
:
F
(
S
t
)
+
<
∇
F
(
S
t
)
,
(
S

S
t
)
>
s
.
t
.
S
∈
γ
,
where S^{t }is the current solution, <, > is the inner product operator, and the new objective function is from the firstorder Taylor expansion. The problem (5) at each iteration is a linear programming problem to search for the local extreme point along the gradient ∇F(S^{t}) as in steepest descent. However, as previously stated, F(S^{t}) takes the matrix L_{1 }norm, which is nonsmooth, and therefore nondifferentiable. To address this last complexity, we apply subgradient methods 15 to replace ∇F(S^{t}) by a subgradient ∂F(S^{t}) instead 16:
Definition (Subgradient): A matrix ∂
F
∈
R
N
×
q
is a subgradient of a function F
:
R
N
×
q
→
R
at the matrix X
∈
R
N
×
q
if F
(
Z
)
≥
F
(
X
)
+
<
∂
F
,
(
Z

X
)
>
,
∀
Z
∈
R
N
×
q
.
In our case, the subgradient of the matrix L_{1 }norm can be presented by its dual normmatrix L_{∞ }norm, which is used to derive the subgradient ∂F(S^{t}). Similar to the derivation for the subgradient of the L_{1 }norm of vectors in L_{1 }regularization in 16, we show that the subgradient of the L_{1 }norm of any matrix
X
N
×
q
is
∂


X


L
1
=
Y
∈
R
N
×
q
;


Y


L
∞
≤
1
if
X
=
0
;
Y
∈
R
N
×
q
;


Y


L
∞
≤
1
a
n
d
<
Y
,
X
>
=


X


L
1
,
otherwise;
where 0 is a N × q matrix of all zeros. For our module identification problem, we have the following proposition derived from (6):
Proposition: The subgradient of the objective function of our relaxed optimization problem F (S) at the assignment S^{t }can be defined as: ∂
F
(
S
t
)
=
2
(
P

W
)
S
t
Q
̄
. In our implementation, we choose
Q
̄
r
s
=
α
Q
r
s
=
0
;
1
Q
r
s
>
0
;

1
Q
r
s
<
0
;
where α is a number between [1, 1].
Proof: From (6), there always exists a
Q
̄
satisfying
Q
̄
L
∞
≤
1
and
Q
L
1
=
<
Q
̄
,
Q
>
. As ∂
Q
L
1
=
∂
<
Q
̄
,
Q
>
and the subgradient of differentiable functions is equal to its gradient 16, we have ∂
F
(
S
t
)
=

∂
[


Q


L
1
]
=

∂
<
Q
̄
,
Q
>
=

∂
tr
(
Q
̄
T
S
t
T
(
W

P
)
S
t
)
=
2
(
P

W
)
S
t
Q
̄
when S^{t }is close to the local minima. QED. □
Convex programming algorithm
Using FrankWolfe algorithm with the derived subgradient, we now have a conditional subgradient method 16 to iteratively solve the relaxed optimization problem as shown in the pseudocode given in the following:.
Algorithm: Conditional Subgradient
Input: initial value S^{t}, t = 0.
Do:
(i) Compute the subgradient ∂F (S^{t}).
(ii) Solve the minimization problem:
S* = arg min_{S }: <∂F (S^{t}), S >s.t. S ∈ γ
(iii) Linear search for the step in the direction S*  S^{t }found in (ii), update S^{t}, t = t + 1.
Until: ΔF + ΔS^{t} <ζ
Output: S^{t}.
In this algorithm, step (ii) at each iteration can be solved using a generic linear programming solver in O((qN )^{3.5}). However, due to the special structure of the optimization problem, we instead solve it as a semilinear assignment problem. As the assignment matrix [∂F(S^{t})]_{N × q }is not a square matrix, the optimization in step (ii) can be efficiently solved by assigning node i to module r, which is the index of the largest entry in row i of subgrident ∂F(S^{t}), with the time complexity O(qN).
To derive the solution to the original problem (4) from the results of the relaxed problem by the conditional subgradient algorithm, we recover from the relaxed solution to a closest feasible solution by a simple rounding up strategy. Finally, we note that the presented conditional subgradient algorithm converges to a local stationary point of the combinatorial optimization problem (4) due to the nonconvex nature of the objective function (3) with the worst case complexity O(qN^{2}) 15. Hence, good initialization is critical to get high quality results. In our current implementation, we initialize S^{t }by a modified ExpectationMaximization (EM) algorithm presented in 8.
Path generation (PG)
In order to make use of the local optima found by the above fast subgradient method, we propose a novel path generation method for our combinatorial optimization problem. The path generation method aims to conserve the overlap between two local optima, and get improvement based on the overlap which contributes significantly to the objective function value. Our new path generation is inspired by the path relinking method which connects two combinatorial local optima and try to find better results along the linking path 14. However, our method does not relink two local optimal results but creates new paths by extracting potentially useful overlap between them.
The essential idea of the path generation method is to construct new results by preserving the overlap between modules from two local optima that contributes significantly to the objective function. Given two solutions x_{A }and x_{B }from SG as the new path generators, PG generates new results and explores the search space while maintaining the current productive overlap between x_{A }and x_{B}. Let N_{r}(x_{A}) denote the module u_{r }of x_{A }and N_{s}(x_{B}) the module u_{s }of x_{B}. The contribution S(r_{A}, s_{B}) from the overlap Over(r_{A}, s_{B}) = {dd = N_{r}(x_{A}) ∩ N_{s}(x_{B})} is defined as:
S
(
r
A
,
s
B
)
=
s
A
B
T
(
W

P
)
S
A
L
1
+
s
A
B
T
(
W

P
)
S
B
L
1
.
Here, s_{AB }is a binary vector, of which each entry is equal to 1 when the corresponding node is in both N_{r}(x_{A}) and N_{s}(x_{B}), and 0 otherwise. The value of S(r_{A}, s_{B }) is the shared contribution to the objective function Q* in (2) between N_{r}(x_{A}) and N_{s}(x_{B}) for two feasible solutions. S_{A }and S_{B }are assignment matrix of the two solutions. The most promising overlap between modules r and s are determined by
(
r
A
,
s
B
)
=
a
r
g
m
a
x
{
S
(
r
,
s
)
:
r
,
s
∈
{
1
,
…
,
q
}
}
.
The path generation based on (9) proceeds in the following manner: First, the most promising overlap Over(r_{A}, s_{B}) between modules r_{A }and s_{B }of the initiating solution x_{A }and the guiding solution x_{B }is identified by (9), then r_{A }is locally adjusted to become Over(r_{A}, s_{B}) by removing nodes. After the adjustment, a new solution x_{1 }is generated and C_{A }= {r_{A}} and C_{B }= {s_{B}}, where C_{A }and C_{B }denote the sets of used modules in both solutions, respectively. Local search is then applied to find the improved
x
1
*
. Then we preserve x1* and let
x
A
=
x
1
*
. The above procedure is repeated until no overlap exists or it reaches other relaxed termination conditions, for example, we can set N_{stop }= 5 meaning that there are no larger than five nodes in the overlap of the modules from two solutions. Finally, we obtain the best solution along the generated paths. The whole procedure is illustrated in the following pseudo code:
Algorithm: Path Generation Method
Input: x_{A}, x_{B}, x, x_{best}, N_{stop}, C_{A }= Ø, C_{B }= Ø, Over = +∞, Q_{best }= −∞
While(
Over > N_{stop}
)
(1) (r_{A}, s_{B}) = argmax{S(r, s): r, s ∈ {1, ..., q} } and find Over(r_{A}, s_{B });
(2) modify nodes from r_{A }in x to make N_{r }(x) = Over(r_{A}, s_{B}) and C_{A }= {r_{A}}, C_{B }= {s_{B}};
(3)
(
Q
x
*
,
x
*
)
= LocalSearch(x);
(4) If
(
Q
x
*
>
Q
b
e
s
t
)
(5) Q_{best }=
Q
x
*
and x_{best }= x*;
(6) EndIf
(7) x_{A }= x* and find the next Over set using (9);
EndWhile
Output: x_{best }and Q_{best}.
To illustrate how PG works, an example of the path generation procedure is shown in Figure 1. The module organization of the given network is shown in Figure 1A. Assume x_{A }= {{1, 2, 4}, {5, 6}, {3, 7}} with
Q
x
A
*
=
0
.
201
and x_{B }= {{1, 2}, {3, 4}, {5, 6, 7}} with
Q
x
B
*
=
0
.
238
. Starting with C_{A }= C_{B }= Ø, a path is generated. At the first step, the most productive overlap between module
r
A
=
u
1
A
in x_{A }and
s
B
=
u
1
B
in x_{B }is identified, and a new solution x_{1 }is obtained by modifying rA=u1A to be the same as Over
(
u
1
A
,
u
2
B
)
with Q_{1 }= 0.201. Update
C
A
=
{
u
1
A
}
and
C
B
=
{
u
1
B
}
. Local search further improves the solution to obtain x1* with
Q
1
*
=
0
.
374
and then set
x
A
=
x
1
*
. Next, module
r
A
=
u
2
A
and
s
B
=
u
2
B
have the overlap with the largest contribution to the objective function. By similarly modifying
N
2
(
x
A
)
=
O
v
e
r
(
u
2
A
,
u
2
B
)
, we then generate a path
x
2
*
with
Q
2
*
=
0
.
374
and update
C
A
=
{
u
1
A
,
u
2
A
}
and
C
B
=
{
u
1
B
,
u
2
B
}
. In the end, we make
N
3
(
x
A
)
=
O
v
e
r
(
u
3
A
,
u
3
B
)
and get the final path
x
3
*
with
Q
3
*
=
0
.
374
and update
C
A
=
{
u
1
A
,
u
2
A
,
u
3
A
}
and
C
B
=
{
u
1
B
,
u
2
B
,
u
3
B
}
. The PG algorithm is executed as in Figure 1 with the time complexity O(q^{2}N^{2}).
<p>Figure 1</p>An example of path generation: A. Network structure; B. Path generation procedure
An example of path generation: A. Network structure; B. Path generation procedure.
Experimental results
We have implemented our SGPG method to identify functional modules in two biological networks: Saccharomyces cerevisia PPI network from the Database of Interacting Proteins (DIP) 11 and Homo sapien PPI network from the Human Protein Reference Database(HPRD) 12. We first show the efficiency of SGPG comparing to the previous algorithms based on SA for functional module identification in two networks with q = 10, 50 and 100. We further evaluate the potential of SGPG to identify biologically meaningful modules by contrasting the differences of the identified finegrained modules (q = 500 for Homo sapien PPI network and q = 300 for Saccharomyces cerevisia PPI network) detected by MCL algorithm 13. We show that SGPG can unearth certain kinds of biologically meaningful modules that may not be detected by MCL.
Performance comparison between SA and SGPG
We first compare SA and SGPG for module identification in two PPI networks with relatively small q = 10, 50, and 100 as SA requires very slow cooling down procedures to guarantee the solution quality when q >100. The Homo sapien PPI network has a largest component of 9,270 nodes and 36,917 edges. The upper bound of the objective function value in (2)
Q
m
a
x
*
=
0
.
98
when we consider the original network itself as the image graph with q =9,270. We also have implemented our algorithm to the Saccharomyces cerevisia PPI network, which has a largest component of 4990 nodes and 21,911 edges with the upper bound
Q
m
a
x
*
=
0
.
97
when q =4,990.
The parameter settings for SA and SGPG are listed in Table 1. The starting temperature and cooling down procedure are two critical parameters that determine the performance of SA. In our implementation, the starting temperature is set high enough and the cooling down procedure is set slow enough to avoid freezing in metastable states. For SGPG, we set the number of local optima N_{set }to 10 and the terminal condition N_{stop }= 5 for the minimum requirement of the overlap set Over in path generation.
<p>Table 1</p>Parameter settings in SA and SGPG
Para.
C_{β}
T_{start}
T_{end}
T_{sweep}
T_{switch}
N_{set}
N_{stop}.
SA
0.99
40
0.001
100
20


SGPG





10
5
Table 2 shows the comparison of the fitting score Q* computed by (2) and the running time between SA and SGPG. Because there is no ground truth of the exact functional modules in both networks, we use the fitting score Q* as the criterion of the optimization performance. All the experiments are programmed in C++ and run on a MacPro Station with a 2.4 GHz CPU and 6 Gb RAM. From Table 2 we find that the quality of the solutions computed by SGPG is very competitive to the solutions of SA with the largest gap 3.9% in Q* when q = 100 for Homo sapien PPI network. Meanwhile, SGPG is significantly faster than SA. Table 2 also reveals that the computation time of SA grows quadratically with increasing q, however, the computation time of SGPG grows subquadratically with q since the time complexity of SG and PG are O(qN^{2}) and O(q^{2}N^{2}) respectively. This makes a big difference when we need to identify a large number of modules. For example, SA takes more than two months for detecting q = 300 modules for the Homo sapien PPI network, while SGPG only requires around two days to obtain the results.
<p>Table 2</p>Comparison of SA and SGPG on Homo sapien and Saccharomyces cerevisia PPI networks
PPI
Method
Q*(q=10)
Time(h)
Q*(q=50)
Time(h)
Q*(q=100)
Time(h)
Homo sapien
SA
0.5393
1.73
0.6530
45.07
0.7180
180.26
SGPR
0.5346
0.5
0.6452
1.95
0.6898
6.35
Saccharomyces cerevisia
SA
0.5692
1.35
0.6834
25.02
0.7544
102.65
SGPR
0.5690
0.3
0.6752
1.15
0.7292
3.34
To further demonstrate whether these two different blockmodel methods have the potential of discovering biologically meaningful modules, we perform Gene Onotolgy (GO) enrichment analysis for the modules identified by both methods using GoTermFinder 17. Figure 2 displays the comparison of the number of significantly enriched modules with different q detected by both SA and SGPG. From both figures, we find that SGPG achieves competitive performances on identifying GO enriched modules comparing to SA.
<p>Figure 2</p>Comparison between SA and SGPG for the number of identified modules of Homo sapien PPI network (A) and Saccharomyces cerevisia PPI network (B) that have significantly enriched GO terms below 1% after Bonferronicorrection in either biological process, molecular function, or cellular compartment
Comparison between SA and SGPG for the number of identified modules of Homo sapien PPI network (A) and Saccharomyces cerevisia PPI network (B) that have significantly enriched GO terms below 1% after Bonferronicorrection in either biological process, molecular function, or cellular compartment.
Comparison between SGPG and MCL
In order to verify the biological significance of the modules identified by blockmodel identification, we have implemented both SGPG and MCL to detect finegrained modules for both Saccharomyces cerevisia and Homo sapien PPI networks. Because SA will take months to obtain results with q >200, we only have applied SGPG in this section. By analyzing the identified modules detected by two methods, we have found that SGPG can discover a comparable number of GO enriched modules as MCL detects. More importantly, SGPG discovers additional biologically meaningful modules in which proteins are sparsely connected but have the same interaction patterns to the rest of the network.
Saccharomyces cerevisia PPI network
We have identified finegrained modules for the Saccharomyces cerevisia PPI network using SGPG and MCL. We set q = 300 for SGPG and the inflation parameter I = 1.5 for MCL, which identified 370 modules in total. Within these identified modules, 296 modules by SGPG and 307 modules by MCL have more than two nodes. From these, we have found 150 and 153 modules respectively with significantly enriched GO terms below 1% after Bonferronicorrection by GoTermFinder. SGPG performs competitively to MCL. But more importantly, we find that SGPG can detect sparsely connected modules with certain interaction patterns that MCL fails to detect.
In order to scrutinize the differences between the modules discovered by SGPG and MCL. We have annotated all modules by KOG categories 18. For each module, the most KOG category assigned to the proteins in the module is annotated to that module. Figure 3A displays the percentage of the KOG annotated modules. Obviously, the percentages of modules annotated to KOG U, K, J and T are different (difference is larger than 2.5%) for the results from two methods. Specifically, SGPG detects more modules with KOG U, K and T annotations. To further examine the functionalities of different KOG categories, we discover that proteins annotated to KOG U play roles in intracellular trafficking, secretion, and vesicular transport; proteins annotated to KOG K have functionalities in transcription; and proteins in KOG T behave signal transduction functionalities. In 2, the results have already illustrated that proteins annotated to KOG T and K have the sparsely connected modules structures. Blockmodel based SGPG has successfully discovered more such modules than MCL. For proteins in KOG J (functions in translation, ribosomal structure and biogenesis), they are supposed to have densely connected modular structures which MCL tends to detect.
<p>Figure 3</p>Percentage of different categories of modules detected by SGPG and MCL (annotated by KOG)
Percentage of different categories of modules detected by SGPG and MCL (annotated by KOG). A. KOG percentage of Saccharomyces cerevisia PPI network. B. KOG percentage of Homo sapien PPI network.
To verify that SGPG does detect sparsely connected modules, we investigate the network topology of the proteins of all the modules annotated to KOG U, K, J and T. We count the number of sparsely connected modules, which have the interaction density of module less than 3%, annotated to KOG U, K, J and T. The specific comparison is in Table 3. Table 3 illustrates the difference of network topology by the average module density and the average clustering coefficient among proteins detected by both SGPG and MCL. The average clustering coefficients of the induced network, which is extracted from the original Saccharomyces cerevisia PPI network based on the proteins of modules annotated to certain KOG categories, are calculated by the definition in 19. Larger average clustering coefficients indicate that proteins have densely connected modular structures 19. From the table, we find there is a trend that the average module density and the average clustering coefficient of the modules identified by MCL are larger than those detected by SGPG, which means MCL detects more densely connected modules while SGPG can identify sparsely connected modules.
<p>Table 3</p>Topological analysis of different KOG categories in Saccharomyces cerevisia PPI network
KOG ID
Method
proteins
sparse modules/modules
Avg. density
Avg. clustering coef.
U
SGPG
353
15/26
2.98%
0.0814
MCL
256
0/21
27.38%
0.2402
K
SGPG
359
6/24
6.68%
0.1352
MCL
361
0/19
26.35%0
0.1834
J
SGPG
579
9/24
9.16%
0.0678
MCL
358
0/25
37.90%
0.1429
T
SGPG
169
13/21
3.47%
0.0755
MCL
94
0/12
31.31%
0.0912
Figure 4 illustrates an induced subnetwork of sparsely connected modules discovered by SGPG from the Saccharomyces cerevisia PPI network. Only the interactions among the proteins in Figure 4 are displayed. As shown in Figure 4, modules A and B are both sparsely connected modules, in which there is no interaction among proteins. Module C is a topologically cohesive module. Modules A, B and C are all significantly enriched in GO terms related to KOG G (carbohydrate transport and metabolism), T (signal transduction mechanisms) and C (energy production and conversion) respectively. From the structure illustrated in Figure 4, we find that proteins in module B play roles in passing signal between proteins of hexokinase activity and nucleoside phosphate metabolism. Furthermore, we notice that the marked patterns I and II are two types of interaction patterns across these modules, which tend to be clustered into the same module when using MCL. Figure 4 clearly displays the advantage of SGPG, which is to identify modules by their interaction patterns and functional roles rather than their interaction density.
<p>Figure 4</p>A subnetwork with sparsely connected modules detected by SGPG
A subnetwork with sparsely connected modules detected by SGPG. Module A is enriched in hexokinase activity with pvalue 1.71e5. Module B is enriched in response to endogenous stimulus with pvalue 4.77e5. Module C is enriched in nucleoside phosphate metabolism with pvalue 3.43e6. Patterns I and II are two specific interaction patterns in the subnetwork.
Additionally, Table 4 illustrates three sparsely connected module examples including module B in Figure 4, which are all detected by SGPG but missed by MCL. These three modules are annotated to KOG U and T respectively, which cannot be detected by MCL no matter what inflation parameter we choose.
<p>Table 4</p>Sparse modules in U and T KOG categories for Saccharomyces cerevisia PPI network
KOG ID
Sparse module example
Enriched genes
Enriched GO Term
GO Level
pvalue
U
YDR179C, YNL287W, YDL216C YCR099C, YIL004C,YAL026C YLR268W, YLR093C, YPR163C YPR148C, YOL064C, YOL117W YGL084C, YLR031W, YIL076W YPL179W, YKL191W, YPL010W
YOL117W, YDR179C, YDL216C
protein deneddylation
[+8, 0]
2.01e5
T
YJL092W, YDR490C, YOR231W YJL005W, YPL074W, YPL083C YNL323W,YOL100W
YDR490C, YOL100W, YNL323W, YJL005W, YOR231W
signal transduction
[+3, 1]
6.09e5
T
YDR076W, YDL059C, YJL173C YPL164C, YER171W, YPL026C YCR079W, YPL150W, YHR169W YJR062C
YDL059C, YPL026C, YER171W, YPL164C, YJL173C, YDR076W
response to endogenous stimulus
[+2, 1]
4.77e5
Homo sapien PPI network
For the Homo sapien PPI network, we set SGPG to identify q = 500 modules with the same settings in Table 1. For MCL, we set its inflation parameter I = 1.5 and have found 450 modules. We have performed GO enrichment analysis for these identified modules with more than two nodes (478 from SGPG and 380 from MCL). Based on GoTermFinder, 269 modules from SGPG and 265 modules from MCL are significantly enriched with pvalues below 1% after Bonferronicorrection. SGPG has discovered a competitive number of GO enriched modules compared to MCL. We also note that the modules identified by SGPG are relatively smaller than those from MCL and these modules have more specific enriched functionalities and may provide more detailed information for future catalog of functional modules. More importantly, SGPG detects several modules with interesting functionalities that MCL has missed.
Following the same analysis method used in the previous section, we first annotate all the identified modules with KOG categories to scrutinize the differences between modules detected by SGPG and MCL. Figure 3B shows the percentages of the modules annotated to different KOG categories by both methods. Obviously, SGPG detects more modules annotated in KOG T and K categories, within which functional modules tend to have sparsely connected structures. However, MCL discovers more modules annotated in KOG U, within which functional modules tend to have a densely connected structure in the Homo sapien PPI network.
Table 5 further consolidates that the modules detected by SGPG can detect sparsely connected patterns that MCL may miss. The average density and average clustering coefficient both indicate that modules discovered by MCL have cohesive modular structure, while modules discovered by SGPG are sparsely connected.
<p>Table 5</p>Topological analysis of different KOG categories in Homo sapien PPI network
KOG ID
Method
proteins
sparse modules/modules
Avg. density
Avg. clustering coef.
T
SGPG
1970
59/126
4.91%
0.0822
MCL
2481
0/66
26.32%
0.1696
K
SGPG
878
27/59
3.15%
0.0779
MCL
916
0/37
30.41%0
0.1928
U
SGPG
592
3/24
4.95%
0.0448
MCL
517
0/33
31.42%
0.1359
Figure 5 illustrates an induced subnetwork discovered by SGPG from the Homo sapien PPI network. Only the interactions among the proteins in Figure 5 are exhibited. As shown in Figure 5, modules A, B and C are all sparsely connected modules, which have no interactions inside the modules. Proteins in module D only have a few connections. Modules A and B are annotated to KOG K (transcription). While module D is annotated to KOG T (signal transduction mechanisms). Module C is annotated to both KOG T and K. Module C contains proteins SMAD2 and SMAD3 which play an important role in tumor generation 20. From the module structure in Figure 5, we find that SMAD2 and SMAD3 have intimate relationships to proteins of DNA binding, cellular response and kinase activity, which is useful to help us to obtain a better understanding of their functionalities and influence on other proteins. Furthermore, the two interaction patterns preserved in the module structure are detected by SGPG, which is difficult for MCL to identify.
<p>Figure 5</p>A subnetwork with sparsely connected modules detected by SGPG
A subnetwork with sparsely connected modules detected by SGPG. Module A is enriched in sequencespecific DNA binding with pvalue 9.91e7. Module B is enriched in cellular response to calcium ion with pvalue 4.04e7. Module D is enriched in MAP kinase activity with pvalue 8.60e5. Patterns I and II are two specific interaction patterns in the subnetwork.
Table 6 lists three sparsely connected module examples detected by SGPG but missed by MCL. These three modules are annotated to KOG T and K respectively, which cannot be detected by MCL no matter what inflation parameter we choose.
<p>Table 6</p>Sparse modules in T and K KOG categories for Homo sapien PPI network
KOG ID
Sparse module example
Enriched genes
Enriched GO Term
GO Level
pvalue
T
NTRK1, NTRK3, NTRK2 VAV1, VAV3
NTRK1, NTRK2, NTRK3
neurotrophin receptor activity
[+3, 1]
2.95e9
T
PIK3R3, PIK3R2, PIK3R1
PIK3R3, PIK3R2, PIK3R1
phosphatidylinositol 3kinase complex
[+5, 1]
4.77e9
K
JUN, JUNB, JUND
SPIB
JUN, JUNB,
JUND
cellular response to
calcium ion
[+6, 1]
4.04e7