Department of Computer Science, Iowa State University, Ames, IA, 50011, USA
Department of Computer Science, Free University of Berlin, 14195 Berlin, Germany
Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
Current address: Luxembourg Centre for Systems Biology, University of Luxembourg, L4362 EschsurAlzette, Luxembourg
Abstract
Background
Biological networks provide fundamental insights into the functional characterization of genes and their products, the characterization of DNAprotein interactions, the identification of regulatory mechanisms, and other biological tasks. Due to the experimental and biological complexity, their computational exploitation faces many algorithmic challenges.
Results
We introduce novel weighted quasibiclique problems to identify functional modules in biological networks when represented by bipartite graphs. In difference to previous quasibiclique problems, we include biological interaction levels by using edgeweighted quasibicliques. While we prove that our problems are NPhard, we also describe IP formulations to compute exact solutions for moderately sized networks.
Conclusions
We verify the effectiveness of our IP solutions using both simulation and empirical data. The simulation shows high quasibiclique recall rates, and the empirical data corroborate the abilities of our weighted quasibicliques in extracting features and recovering missing interactions from biological networks.
Introduction
Cellular processes such as transcription, replication, metabolic catalyses, or the transport of substances are carried out by molecules that are associated in functional modules, and are often realized as physical interaction within protein complexes. These physical interactions form molecular networks. Analyzing these networks is a thriving field (e.g.
These computational problems typically result from incomplete and errorprone networks that largely obfuscate the reliable identification of modules
Unweighted quasibiclique approaches have been used in the past to identify modularity in protein interaction networks when presented as bipartite graphs that are spanned between different features of proteins, e.g. binding sites and domain content function
An example of a quasibiclique
An example of a quasibiclique. A quasibiclique (darker nodes and solid edges) identified from a gene interaction network in one of our experiment sets where the edge weights are interaction scores. The bipartite graph is unweighted if only the existence of edges are considered.
Unweighted quasibiclique approaches are sensitive to the quantitative uncertainties intrinsic to molecular networks. Interactions are only represented by an unweighted edge in the bipartite graph if they are above some userspecified threshold. Therefore, unweighted quasibiclique approaches are prone to disregard many of the invaluable interactions that are below the threshold, and treat all interactions above the threshold the same. Further, some interactions may or may not be represented due to some seemingly insignificant error in the measurement. Consequently, many crucial modules may be concealed and remain undetected by using unweighted quasibiclique approaches.
Here we introduce novel weighted quasibiclique problems by using bipartite graphs where edges are weighted by the level of the corresponding interactions, e.g., Figure
Related work
Maximal bicliques in biological networks are selfcontained elements characterizing functional modules. In protein interaction networks they manifest as interactive protein complexes (e.g.,
Our contributions
Here we define a "weighted" version of
Results and discussion
Before analyzing our findings in biological networks, we first introduce formal definitions of weightedquasi bicliques (WQB) and then discuss the results of applying the WQB as a data mining tool.
Preliminaries
A
Maximum weighted quasibiclique (α,βWQB) problem
Definition 1 (
Definition 2 (
In either version, the
Definition 3 (Maximum
Problem 1 (
Note that, we use the same notation (
Query problem
A common requirement in the analysis of networks is to provide the environment of a certain group of genes, which translates into finding the maximum weighted
Problem 2 (
Experiment results
Finding appropriate values for
Simulations
As part of simulation studies, we try to retrieve a known maximum weighted quasi biclique from a weighted bipartite graph using both versions of
For retrieving
Similarly, for retrieving
The ILP models of the corresponding
For the evaluation, let (
Simulation results of
16 × 16
32 × 32
40 × 40
Method 1
Method 2
Constant
Method 1
Method 2
Constant
Method 1
Method 2
Constant
0.5
0.5
33
77
55
100
44
22
33
30
55
6
11
40
75
37
83
79
61
62
0.55
0.45
27
38
100
100
83
88
0
0
0
0
0
0
0
25
0
0
11
0
0.6
0.4
66
77
83
88
100
88
88
56
91
80
66
70
49
53
61
66
61
66
0.65
0.35
72
77
88
88
88
88
56
66
40
66
40
66
66
91
91
100
91
100
0.7
0.3
72
100
100
100
100
100
78
83
91
91
85
91
88
77
100
83
100
83
0.75
0.25
66
83
100
100
100
100
70
69
100
100
100
100
64
91
93
100
93
100
0.8
0.2
66
100
100
100
100
100
100
78
100
91
100
91
70
76
100
100
100
100
Recall of vertices in the simulation. For every experiment, the value in the
Genetic interaction networks
A comprehensive set of genetic interaction and functional annotation published recently by Costanzo
Pairwise comparisons of the total 18 functional classes provide 153 sets. For every distinct pair (
The absolute values of the interaction score
Biological interpretation and examples
Genes with high degree and strong links dominate the results. In several instances, the quasibicliques are trivial in the sense that only one gene is present in
We observed the following with the maximum weighted
Maximum weighted
Recovering missing edges
The published data sets have edges under different thresholds removed. To sample such missing edges, we calculate the average weight of all the edges removed in the 153 bipartite graphs (generated above), and the calculated average weight is 0.0522.
For each of the 153 maximum weighted quasibicliques of either version, the missing edges induced by the quasibicliques are then identified, and the average missing edge weight
We further compare
(1)
(2)
(3)
(4)
(5)
Comparing the averages of
Missing edge recovery in a genetic interaction network
WQB
d05/m1
d05/m2
ab/m2
a005/m2
a01/m2
a02/m2
a03/m2
a04/m2
a05/m2
avg(
0.0855
0.0844
0.0850
0.0806
0.0830
0.0867
0.0905
0.0934
0.1169
WQB




avg(




0.1008
0.0805
0.0809
0.0823
0.0825
A comparison of
Method
Time complexity
Here we prove the NPhardness of the
Lemma 1.
We now prove that checking for the existence of a percentage
Problem 3 (Existence).
To prove the hardness of existence problem we need some auxiliary definitions. A
Definition 4 (Modified
Problem 4 (One sided existence).
Problem 5 (Modified existence).
The series of reductions to prove the hardness of the existence problem are as follows. We first reduce the
Lemma 2.
Given a weighted bipartite graph
We are left to show that
a. Construction: Let
b. ⇒: Let (
⇐: Let (
Hence, the modified existence problem is NPcomplete.
Lemma 3.
a. Construction: Let
b. ⇒ and ⇐: Let (
This proves that the one sided existence problem is NPcomplete.
Lemma 4. Existence
a. Construction: Let
b. ⇒ and ⇐: Any
IP formulations for the
Although greedy approaches are often used in problems of a similar structure, e.g., multidimensional knapsack
Due to the similarity in formulating constraints between
Quadratic programming
For each
The quadratic terms in the constraints are necessary because,
Converted linear programming
A standard approach to convert a quadratic program to a linear one is introducing auxiliary variables to replace the quadratic terms. Here we introduce a binary variable
Expressions (7) and (8) state the condition that
Improved linear programming
Observe that constraint (7) becomes trivial if
There is a variable
Recall that the difference between the two problems
As a results, the problem instance is a
If there are
Conclusions
We address noise and incompleteness in biological networks by introducing graphtheoretical optimization problems that identify variations of novel weighted quasibicliques. These quasibiclique problems incorporate biological interaction levels in different analytical settings and exhibit improvements over unweighted quasibicliques. To meet demands of biologists we also provide a query version of (weighted) quasibiclique problems. We prove that our problems are NPhard, and describe IP formulations that can tackle moderate sized problem instances. Simulations and empirical data solved by our IP formulation suggest that our weighted quasibiclique problems are applicable to various other biological networks.
Future work will concentrate on the design of algorithms for solving largescale instances of weighted quasibiclique problems within guaranteed bounds. Greedy approaches may result in effective heuristics that can analyze evergrowing biological networks. A practical extension to the query problem is the development of an efficient enumeration of all maximal
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
WCC and SV were both responsible for developing the solution, carrying out experiments, and writing of the manuscript. RK performed the experimental evaluation, analysis, and contributed to the writing of the manuscript. OE supervised the project and contributed to the writing of the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This article has been published as part of
We thank Heiko Schmidt for discussions that initiated the concept of weighted quasibicliques. Further, we thank Nick Pappas, John Wiedenhoeft and anonymous reviewers for valuable comments. WCC, SV, and OE were supported in part by NSF awards #0830012 and #1017189.