Abstract
Background
Biological networks provide fundamental insights into the functional characterization of genes and their products, the characterization of DNAprotein interactions, the identification of regulatory mechanisms, and other biological tasks. Due to the experimental and biological complexity, their computational exploitation faces many algorithmic challenges.
Results
We introduce novel weighted quasibiclique problems to identify functional modules in biological networks when represented by bipartite graphs. In difference to previous quasibiclique problems, we include biological interaction levels by using edgeweighted quasibicliques. While we prove that our problems are NPhard, we also describe IP formulations to compute exact solutions for moderately sized networks.
Conclusions
We verify the effectiveness of our IP solutions using both simulation and empirical data. The simulation shows high quasibiclique recall rates, and the empirical data corroborate the abilities of our weighted quasibicliques in extracting features and recovering missing interactions from biological networks.
Introduction
Cellular processes such as transcription, replication, metabolic catalyses, or the transport of substances are carried out by molecules that are associated in functional modules, and are often realized as physical interaction within protein complexes. These physical interactions form molecular networks. Analyzing these networks is a thriving field (e.g. [1]) and has extensive implications for a host of issues in biology, pharmacology [1], and medicine [2]. Capturing the modularities of molecular networks accurately will gain insights into cellular processes and gene function. Yet, before such modularities can be reliably inferred, challenging computational problems have to be overcome.
These computational problems typically result from incomplete and errorprone networks that largely obfuscate the reliable identification of modules [3,4]. Often, molecular interactions can not be measured to the accuracy of the genome sequences, leaving some guesswork in identifying modularities correctly. Some molecular interactions are highly transient and can only be measured indirectly, while others withstand denaturing agents. Functional interaction does not even have to be realized via physical interactions. Thus, computational methods for capturing modularity can not directly rely on presence or absence of interactions in molecular networks and need to be able to cope with substantial error rates.
Unweighted quasibiclique approaches have been used in the past to identify modularity in protein interaction networks when presented as bipartite graphs that are spanned between different features of proteins, e.g. binding sites and domain content function [3,5]. An example is depicted in Figure 1. While these approaches aim to solve NPhard problems using heuristics, they were able to identify some highly interactive protein complexes [6,7].
Figure 1. An example of a quasibiclique. A quasibiclique (darker nodes and solid edges) identified from a gene interaction network in one of our experiment sets where the edge weights are interaction scores. The bipartite graph is unweighted if only the existence of edges are considered.
Unweighted quasibiclique approaches are sensitive to the quantitative uncertainties intrinsic to molecular networks. Interactions are only represented by an unweighted edge in the bipartite graph if they are above some userspecified threshold. Therefore, unweighted quasibiclique approaches are prone to disregard many of the invaluable interactions that are below the threshold, and treat all interactions above the threshold the same. Further, some interactions may or may not be represented due to some seemingly insignificant error in the measurement. Consequently, many crucial modules may be concealed and remain undetected by using unweighted quasibiclique approaches.
Here we introduce novel weighted quasibiclique problems by using bipartite graphs where edges are weighted by the level of the corresponding interactions, e.g., Figure 1. We show that these problems are, similar to their unweighted versions, NPhard. However, in practice, exact Integer Programming (IP) formulations can efficiently tackle many NPhard realworld problems [8]. Therefore, we describe exact Integer Programming (IP) solutions for our weighted quasibiclique problems. Furthermore our IP solutions exploit the sparseness of molecular networks when represented as bipartite graphs. This allows to verify the ability of our IP solutions using a moderately sized genetic network [9], and simulation studies. In addition our IP solutions can provide exact results for instances of the unweighted quasibiclique approaches that were previously not available.
Related work
Maximal bicliques in biological networks are selfcontained elements characterizing functional modules. In protein interaction networks they manifest as interactive protein complexes (e.g., [3,7]). Bipartite graphs are graphs whose vertices can be bipartitioned into sets X and Y such that each edge is incident to vertices in X and Y. A biclique is a subgraph of a bipartite graph where every vertex in one partition is connected to every vertex in the other partition by an edge. A biclique is maximal if it is not properly contained in any other biclique, and it is maximum if no other maximal bicliques have larger total edge weights. The problem of finding maximum bicliques is well studied in the literature of graph theory and is known to be NPcomplete [10] and effective heuristics for this problem have been described and used in various applications [11]. However, bicliques are too stringent for identifying modules in real world networks [12]. For example a module is not identified through a biclique that is incomplete by one single edge. Quasibicliques are partially incomplete bicliques that overcome this limitation. They allow a specified maximum number of edges to be missed in order to form a biclique [13]. While quasibicliques are less stringent for the identification of modules, they might contain genes that are interacting with only a few or none other of the genes. Such situations occur when the missing edges are not homogeneously distributed throughout. The δquasibicliques (δQB) [14] allow to control the distribution of missing edges by setting lower bounds, parametrized by δ, on the minimum number of incident edges to vertices in each of the vertex sets in a δQB.
Our contributions
Here we define a "weighted" version of δQB, called α,βweighted quasibicliques (α,βWQB), to improve on the identification of modules in molecular networks by using the interaction levels between genes. Thus, α,βWQB's may be better applicable to handle noisy data sets as they distribute the overall missing information across the vertices of the quasibiclique as shown in our simulations. We define two versions of α,βWQB's, each in terms of the amount of weight the quasibiclique is allowed to miss. The two different versions of weighted quasibicliques provide flexibility in choosing the missing weight. For the first version, called the percentage version, we define the missing weight in terms of the percentage of number of vertices in the quasi biclique. While for the second version, called the constant version, the missing weight is defined as a constant. The need for constant version weighted quasibicliques arises from the fact that, for certain applications, the weight allowed to be missed does not depend on any of the graph parameters and is a constant. Finding a maximum α,βWQB in a given edge weighted bipartite graph is NPhard, since it is a generalization of the NPhard problem to find δQBs in unweighted bipartite graphs [3]. We also introduce a "query" version of the maximum α,βWQB problem that allows biologists to focus their analyzes on genes of their particular interest. Given a network and specific genes from this network, called query, the query problem is to find a maximum weighted α,βWQB that includes the query. We prove this problem to be NPhard. While the maximum α,βWQB problem and its query version are NPhard, we provide exact IP solutions to solve both problems. By reducing the number of required variables and exploiting the sparseness of bipartite graphs representing molecular networks, our solutions solve moderatesized instances. This allows us to verify the applicability of α,βWQB by analyzing the most complete data set of genetic interactions available for the Eukaryotic model organism Saccharomyces cerevisiae. Our results not only extract meaningful yet unexpected quasibicliques under functional classes, but also suggest higher possibilities of recovering missing interactions not presented in the input. A preliminary version of this work appeared in ISBRA 2011 [15]. In this paper, we extend the usage of the parameters α and β such that the edge weight threshold can be either a ratio of the α,βWQB size or a constant. The time complexity and the application of this extended α,βWQB are both discussed.
Results and discussion
Before analyzing our findings in biological networks, we first introduce formal definitions of weightedquasi bicliques (WQB) and then discuss the results of applying the WQB as a data mining tool.
Preliminaries
A bipartite graph, denoted by (U + V, E), is a graph whose vertex set can be partitioned into the sets U and V such that its edge set E consists only of edges {u, v} where u ∈ U and v ∈ V (U and V are independent sets). Let G := (U + V, E) be a bipartite graph. The graph G is called complete if for any two vertices u ∈ U and v ∈ V there is an edge {u, v} ∈ E. A biclique in G is a pair (U', V') that induces a complete bipartite subgraph in G, where U' ⊆ U and V' ⊆ V. Since any subgraph induced by a biclique is a complete bipartite graph, we use the two terms interchangeably. A pair (U, V) includes another pair (U', V') if U' ⊆ U and V' ⊆ V. In such case, we also say that the pair (U', V') is included in (U, V). A pair (U, V ) is nonempty if both U and V are nonempty. A weighted bipartite graph, denoted by (U + V, E, ω), is a complete bipartite graph (U + V, E) with a weight function ω : E → [0, 1].
Maximum weighted quasibiclique (α,βWQB) problem
Definition 1 (α,βWQB_{P })). Let G := (U + V, E, ω) and α,β ∈ [0, 1]. A percentage version α,βweighted quasibiclique, denoted as α,βWQB_{P}, in G is a nonempty pair (U', V') that is included in (U, V ) and satisfies the two properties:
(1) ∀u ∈ U' : ∑_{v∈V' }ω(u, v) ≥ α V', and (2) ∀v ∈ V' : ∑_{u∈U' }ω(u, v) ≥ βU'.
Definition 2 (α,βWQB_{C }). Let G := (U + V, E, ω) and α,β ∈ [0, ∞). A constant version α,βweighted quasibiclique, denoted as α,βWQB_{C}, in G is a nonempty pair (U', V') that is included in (U, V ) and satisfies the two properties:
(1) ∀u ∈ U' : ∑_{v∈V' }ω(u, v) ≥ V'  α, and (2) ∀v ∈ V' : ∑_{u∈U' }ω(u, v) ≥ U'  β.
In either version, the weight of an α,βWQB is defined as the sum of all its edge weights.
Definition 3 (Maximum α,βWQB_{P(C)}). A α,βWQB_{P(C)}, is a maximum α,βWQB_{P(C) }of a weighted bipartite graph G := (U + V, E, ω), if its weight is at least as much as the weight of any other α,βWQB_{P(C) }in G for given values of α and β
Problem 1 (α,βWQB_{P(C)}).
Instance: A weighted bipartite graph G := (U + V, E, ω), and values α, β ∈ [0, 1]([0, ∞)).
Find: A maximum weighted α,βWQB_{P(C) }in G.
Note that, we use the same notation (α,βWQB_{P(C)}) for α, βweighted quasibiclique and maximum weighted α, βweighted quasibiclique problem of either version. The context in which we use the notation will make the difference clear. Also, when we just say α,βWQB, the context will make clear if we are referring to percentage version or constant version or both.
Query problem
A common requirement in the analysis of networks is to provide the environment of a certain group of genes, which translates into finding the maximum weighted α,βWQB_{P(C) }which includes a specific set of vertices. We call this the query problem and is defined as follows.
Problem 2 (Query_{P(C)}).
Instance: A weighted bipartite graph G := (U + V, E, ω), values α, β ∈ [0, 1] ([0, ∞)), and a pair (P, Q) included in (U, V).
Find: The α,βWQB_{P(C) }which includes (P, Q) and has a weight greater than or equal to the weight of any α,βWQB_{P(C) }which includes (P, Q).
Experiment results
Finding appropriate values for α and β is a critical part in an application. The α and β values allow the user to customtailor the search based on the weight distribution and the expected findings of the particular application. Typically, quasibicliques of different α and β values have to be analyzed by the domain experts in order to optimize the findings. We use simulated data sets to explore the problem of finding right α and β values. We then use our IP model to explore α,βWQB's in a real world application, a recent data set of functional groups formed in genetic interactions. The filtered data set, compared to the raw data, served to investigate the role of nonexisting edges in the input bipartite graph. While mathematically equivalent in the modeling step, a nonedge in a experimentally generated network represents either a true noninteraction or a falsenegative. Assuming the input consists of meaningful features, our preliminary results show that α,βWQB's may recover missing edges with potentially higher weights better than δquasibicliques.
Simulations
As part of simulation studies, we try to retrieve a known maximum weighted quasi biclique from a weighted bipartite graph using both versions of α,βWQB's. In each simulation experiment we do the following. The pair (U, V) represents the vertices of a weighted bipartite graph G. We randomly choose U' ⊆ U and V' ⊆ V as vertices of the known quasi biclique in G. The sizes of both U' and V' are set the same and is picked randomly, but is limited to a specific percentage of the total vertices on each side. Random edges between the vertices of U' and V' in G are introduced according to a predetermined edge density d. The edges between vertices of U\U' and V\V' of G are also generated randomly according to a predetermined density d'. The edge weights of the known quasibiclique (U', V') are determined by a Gaussian distribution with a mean mn and standard deviation dev. Weights of the edges of G not present in the quasibiclique are also determined by a Gaussian distribution with a lower mean mn' and standard deviation dev'. We now retrieve maximum weighted α,βWQB_{P }and α,βWQB_{C }from G by using specific values α and β calculated as described below.
For retrieving α,βWQB_{P }, the values α and β are chosen in two different ways. As part of the simulation we evaluate the performance of both methods. The first method sets both α and β to the mean of the weights of the edges of the quasi biclique. In the second method, α and β are calculated as given below:
Similarly, for retrieving α,βWQB_{C}, the values α and β are calculated as given below.
The ILP models of the corresponding α,βWQB problems are generated in Python, and solved in Gurobi 4 [16] on a PC with an Intel Core2 Quad 2.4 GHz CPU with 8 GB memory.
For the evaluation, let (U", V") represent the maximum weighted α,βWQB returned by the ILP model. The percentage of the vertices of U' in U" is called the recall of U'. Similarly, the percentage of vertices of V' in V" is called the recall of V'. The recalls of U' and V' are our evaluation criteria. For a specific graph sizes experiments were run by varying the values mn and mn'. The values dev and dev' were set 0.1. The densities d and d' are set to 0.8 and 0.2. The experiments were run for graphs of size 16 × 16, 32 × 32 and 40 × 40. Each experiment is repeated thrice and the average number of recalled vertices is calculated. The recall of the experiments can be seen in Table 1. For each graph size the first two columns represent the recall values for percentage version α,βWQB's and the third column represents the recall value for constant version α,βWQB. As the difference between the means increases, so does the average recall. The second method of choosing α and β for percentage version α,βWQB yields a consistently higher recall.
Table 1. Simulation results of α, βWQB recall
Genetic interaction networks
A comprehensive set of genetic interaction and functional annotation published recently by Costanzo et al. [9] is amongst the best single data sources for weighted biological networks. The aim of our application is to identify the maximum weighted quasibicliques consisting of genes in different functional classes in the Costanzo dataset.
Pairwise comparisons of the total 18 functional classes provide 153 sets. For every distinct pair (A, B) of such classes, we build a weighted bipartite graph (U_{A }+ V_{B}, E, ω) where genes from functional class A are represented as vertices in U_{A }and genes from functional class B are represented as vertices in V_{B}.
The absolute values of the interaction score ε, are used as the edge weights. Values greater than 1 are rounded off to 1. Any gene present in both the functional classes A and B is represented as different vertices in the partitions U_{A }and V_{B }and the edge between those vertices is given a weight of 1. We build LP models of both α,βWQB versions for the bipartite graphs to identify the maximum weighted quasibicliques.
Biological interpretation and examples
Genes with high degree and strong links dominate the results. In several instances, the quasibicliques are trivial in the sense that only one gene is present in U', and it is linked to more than 20 genes in V'. Such quasibicliques are maximal by definition but provide limited insight. A minimum of m = 2 genes per subset was included as an additional constraint to the LP model. It might be sensible to implement such restrictions in the application in general.
We observed the following with the maximum weighted α,βWQB_{P}'s in the data sets. Given the low overall weight, the α,βWQB_{P}'s generated with the parameters α = β = 0.1 were the most revealing. Though we found many interesting quasibicliques in the 153 bipartite graphs, we only present a couple of them here. A notable latent set that was obtained identified genes involved in amino acid biosynthesis (SER2, THR4, HOM6, URE2) and was found to form a 4 × 10 maximum weighted quasibiclique with genes coding for proteins of the translation machinery, elongation factors in particular (ELP2, ELP3, ELP4, ELP6, STP1, YPL102C, DEG1, RPL35A, IKI3, RPP1A). These connections, to our knowledge, are not described and one might speculate that this is a way how translation is coupled to the aminoacid biosynthesis. In some cases the maximum weighted quasibiclique is centered around the genes that are annotated in more than one functional class as they provide strong weights. These genes are involved in mitochondrial to nucleussignaling and are examples where our approach recovers known facts. Using the query approach, it is possible to obtain quasibicliques around a gene set of interest quickly and extend the approach proteins of interest.
Maximum weighted α,βWQB_{C}'s generated from the data sets with parameters α = β = 5 reveal the following. Genetic interaction networks allow to study proteincoding genes as well as genes that might only code for RNAs. A noteworthy example was discovered in the comparison of genes involved in nuclear transport and those with an unknown bioprocess revealed proteins that are part of the nuclear pore transport (POM34, NUP60, NUP157, THP2 and POM152). They interact with a number of genes that are lined up on chromosome 15 (YML033W, YML034C (SRC1) and YML035CA) as well as and YDR431W. Most of these genes they interact with are annotated as "dubious" in the current version of the Yeast Genome Database SGD [17]. SRC1 overlaps with another uncharacterized gene YML034CA. It would be possible that locus codes of a long RNA are involved in nuclear transport.
Recovering missing edges
The published data sets have edges under different thresholds removed. To sample such missing edges, we calculate the average weight of all the edges removed in the 153 bipartite graphs (generated above), and the calculated average weight is 0.0522.
For each of the 153 maximum weighted quasibicliques of either version, the missing edges induced by the quasibicliques are then identified, and the average missing edge weight e of each is calculated. The average missing edge weight e is always greater than 0.0522. In other words, we observe that a missing edge in a maximum weighted quasibiclique has a higher expected weight than the weight of a randomly selected missing edge. This happens when the α and β values are chosen to derive α,βWQB's which are more dense in terms of weight.
We further compare e from our approach to e from the δquasibicliques (δQB) described by Liu et al. [3]. All quasibicliques (including exact δQB using our IP formulation) used to induce average missing edge weight e are:
(1) D05/M1: δQB with δ = 0.5 and minimum node size is 1, i.e., m = 1.
(2) D05/M2: δQB with δ = 0.5; m = 2.
(3) AB/M2: α,βWQB_{P }using the minimum average edge weights found from D05/M2 as α and β; m = 2.
(4) AX/M2: α,βWQB_{P }where X = α = β ∈ {0.05, 0.1, 0.2, 0.3, 0.4, 0.5}; m = 2.
(5) CX/M2: α,βWQB_{C }where X = α = β ∈ {1, 2, 3, 4, 5}; m = 2.
Comparing the averages of e from A005/M2 to A05/M2 (please see Table 2), we see a steady increase. Since α and β can be seen as expected edge weights of the resulting QB, the changing in e shows that QB's identify subgraphs of expected edge weights. However, we do not see a similar pattern in the constant versions. Overall, in this particular experiment data set, the removed edge weights are at most 0.16, hence e can never approach closely to the parameter α no matter how lenient the parameters are.
Table 2. Missing edge recovery in a genetic interaction network
Method
Time complexity
Here we prove the NPhardness of the α,βWQB_{P(C) }problem by a reduction from the maximum edge biclique problem. Note that the query_{P(c) }problem is a generalization of α,βWQB_{P(C) }problem and hence, is also NPhard.
Lemma 1. The α,βWQB_{P(C) }problem is NPhard.
Proof. Given a bipartite graph G := (U + V, E) and an integer k, the maximum edge biclique problem asks if G contains a biclique with atleast k edges. The maximum edge biclique problem is NPcomplete [10]. Let G' := (U + V, E', ω ) be a weighted bipartite graph where ω(u, v) is set to 1 if (u, v) ∈ E or is set to 0 otherwise. Note that, there is a biclique with k edges in G if and only if the maximum weighted α,βWQB_{P }in G' has a weight of atleast k when α and β are set to 1. Similarly, there is a biclique with k edges in G if and only if the maximum weighted α,βWQB_{C }in G' has a weight of atleast k when α and β are set to 0. Therefore, the α,βWQB_{P(C) }problem is NPhard.
We now prove that checking for the existence of a percentage α,βWQB in a bipartite graph is NPcomplete. Note that, checking the existence of a constant version α,βWQB in a bipartite graph can be done in polynomial time. For rest of the section we only refer to percentage version α,βWQB's.
Problem 3 (Existence).
Instance: A weighted bipartite graph G := (U + V, E, ω), values α, β ∈ [0, 1].
Find: If there exists a α,βWQB_{P }(U', V') in G.
To prove the hardness of existence problem we need some auxiliary definitions. A modified weighted bipartite graph, denoted by (U + V, E, Ω), is a complete bipartite graph (U + V, E) with a weight function Ω: E → [0, 1] where, for any two edges e and e', Ω(e)  Ω(e') ≤ 1.
Definition 4 (Modified α,βWQB (MOWQB)).
Let G := (U + V, E, Ω) be a modified weighted bipartite graph. A nonempty pair (U', V') included in (U, V) is a MOWQB of G, if it satisfies the three properties: (1) (U', V') includes (∅, V), (2) ∀u ∈ U' : ∑_{v∈V'}, w(u, v) ≥ 0, and (3) ∀v ∈ V' : ∑_{u∈U'}, w(u, v) ≥ 0.
Problem 4 (One sided existence).
Instance: A weighted bipartite graph G := (U + V, E, ω), values α, β ∈ [0, 1].
Find: If there exists a α,βWQB_{P }(U', V') in G which includes the pair (∅, V).
Problem 5 (Modified existence).
Instance: A modified weighted bipartite graph G := (U + V, E, Ω).
Find: If there exists a MOWQB in G.
The series of reductions to prove the hardness of the existence problem are as follows. We first reduce the partition problem, which is NPcomplete [18], to the modified existence problem. The modified existence problem is then reduced to the one sided existence problem. The one sided existence problem reduces to the existence problem.
Lemma 2. The modified existence problem is NPcomplete.
Proof. The proof of MOWQB ∈ NP can be briefly described in the following.
Given a weighted bipartite graph G(U + V, E, Ω) and a pair (U', V') included in (U, V), it can be verified in polynomial time if the pair (U', V') satisfies the all the MOWQB constraints for G. So, the modified existence problem belongs to class NP. The reduction from partition problem is as follows.
We are left to show that partition ≤_{p }MOWQB. Given a finite set A, and a size s(a) ∈ Z^{+ }associated with every element a of A, the partition problem asks if A can be partitioned into two sets (A_{1}, A_{2}) such that .
a. Construction: Let SUM be the sum of sizes of all elements in A. Build a modified weighted bipartite graph G := (U + V, E, Ω) as follows. For every element a in A there is a corresponding vertex u_{a }in U. The set V contains two vertices v_{+ }and v_{}. For every vertex u_{a }∈ U, Ω(u_{a}, v_{+}) = s(a)/(2 × SUM) and Ω(u_{a}, v_{}) = s(a)/(2 × SUM). Add an additional vertex u_{sum }to set U. Set Ω(u_{sum}, v_{+}) to 1/4 and Ω(u_{sum}, v_{}) to 1/4. Note that, the weights assigned to edges of G satisfy the constraint on Ω for a modified weighted bipartite graph.
b. ⇒: Let (A_{1}, A_{2}) be a partition of A such that the sum of the sizes of elements in A_{1 }is equal to the sum of the sizes of elements in A_{2}. Let U_{1 }= {u_{a }: a ∈ A_{1}}. The sum of weights of all edges from v_{+ }to the vertices in U_{1 }is equal to 1/4. Let U' = U_{1 }∪ u_{sum}. The sum of weights of all edges from v_{+ }to vertices of U' is 0. Similarly, the sum of weights of all edges from v_{ }to vertices of U' is 0. Thus, (U', V) is a MOWQB of G.
⇐: Let (P, V) be a MOWQB of G. The edge from v_{ }to u_{sum }is the only positive weighted edge from vertex v_{}. So, P will contain vertex u_{∑}. Since Ω(v_{+}, u_{sum}) is negative, set P will also contain vertices from U  u_{sum}. The sum of the weights of edges from v_{ }to vertices in P  u_{sum }cannot be smaller than 1/4. Similarly, the sum of the weights of edges from v_{+ }to vertices in P  u_{sum }cannot be smaller than 1/4. So, the sum of all elements in A corresponding to the vertices in P  u_{sum }should be equal to SUM/2. This proves that if G contains a MOWQB, set A can be partitioned.
Hence, the modified existence problem is NPcomplete.
Lemma 3. The one sided existence problem is NPcomplete.
Proof. The proof of one sided existence ∈ NP is omitted for brevity. Next we show MOWQB ≤_{p }one sided existence. We prove this problem to be NPcomplete by a reduction from the modified existence problem. The reduction is as follows.
a. Construction: Let G := (U + V, E, Ω) be the modified weighted bipartite graph in an instance of the modified existence problem. We build a graph G' := (U + V, E, ω) for an instance of one sided existence problem from G. Notice that the partition and vertices remain the same. If the weight of every edge in the G is non negative, set α = β = 0 and ω(u, v) = Ω(u, v) for every edge (u, v) ∈ E. Otherwise, set α and β to x and ω(u, v) = Ω(u, v)  x for every edge (u, v) ∈ E, where x is the minimum edge weight in G.
b. ⇒ and ⇐: Let (U', V) be a MOWQB of graph G. If weights of all edges in G are non negative, the constraints for both the problems are the same. If G has negative weighted edges, the constraints of both the problems will be the same when α,β and ω for the one sided existence problem instance are set as mentioned in the construction. It can be seen that there is a MOWQB in G if and only if there is a α,βWQB_{P }in the graph G' which includes the pair (∅, B).
This proves that the one sided existence problem is NPcomplete.
Lemma 4. Existence problem is NPcomplete.
Proof. Given a set of vertices (U', V'), a weighted bipartite graph G = (U + V, E, ω) and values α, β ∈ [0, 1], it can be verified in polynomial time if (U', V') is a α,βWQB_{P }in G. Thus, the existence problem belongs to NP. We now show that One sided existence ≤_{p }existence.
a. Construction: Let G' = (U + V, E', ω), α', β' ∈ [0, 1] be the parameters of the one sided existence problem. We build the weighted bipartite graph G = (U_{p }+ V, E, ω) for the instance of existence problem as follows. First, set G = G'. For every vertex u ∈ U, let S_{u }denote the sum of the weights of all edges incident on u. Delete every vertex u ∈ U whose S_{u }is less than αV. Let (U_{p }+ V) denote the remaining vertices, and E' represent the remaining edges in G. For the instance of the existence problem, set α = 0 and β = β'.
b. ⇒ and ⇐: Any α,βWQB_{P }in G' which includes (∅, V), is also α,βWQB_{P }in G. Consider a α,βWQB_{P }(U', V') in G. If V' = V, then (U', V') is a α,βWQB_{P }in G' which includes the pair (∅, V). If V' is not the same set as V, the pair (U', V) is still a α,βWQB_{P }in G' and it includes the pair (∅, V).
IP formulations for the α,βWQB problem
Although greedy approaches are often used in problems of a similar structure, e.g., multidimensional knapsack [19], δQB [3], in our experiments, both greedy and randomized approach did not identify solutions close enough to the exact solutions. In our experiments, simple greedy and randomized solutions yielded accuracies ranging from 60% to 95% depending on various parameters without performance guarantee. Hence we consider that it is rather important here to find exact solutions in order to demonstrate the usefulness of α,βWQB's. Here we present integer programming (IP) formulations solving the α,βWQB problem in exact solutions.
Due to the similarity in formulating constraints between α,βWQB_{C }and α,βWQB_{P}, we start by formulating a solution to α,βWQB_{P }. Our initial IP requires quadratic constraints, which are then replaced by linear constraints such that it can be solved by various optimization software packages. Our final formulation is further improved by adopting the implication rule to simplify variables involved. This improved formulation requires variables and constraints linear to the number of input edges, and thus, suits better for sparse graphs. Throughout the section, unless stated otherwise, G := (U + V, E, ω) represents a weighted bipartite graph, and G' = (U', V') represents the maximum weighted α,βWQB of G and E' represents the edges induced by G' in G.
Quadratic programming
For each u ∈ U (v ∈ V), a binary variable x_{u }(x_{v}) is introduced. The variable x_{u }(x_{v}) is 1 if and only if vertex u (v) is in U' (V'). The integer program to find the solution G' can be formulated as follows.
The quadratic terms in the constraints are necessary because, α and β thresholds apply only to vertices in U' and V'. This formulation uses variables and constraints linear to the size of input vertices, i.e., O(U + V). Since solving a quadratic program usually requires a proprietary solver, we reformulate the program so that all expressions are linear.
Converted linear programming
A standard approach to convert a quadratic program to a linear one is introducing auxiliary variables to replace the quadratic terms. Here we introduce a binary variable y_{uv }for every edge (u, v) in G, such that, y_{uv }= 1 if and only if x_{u }= x_{v }= 1, i.e., the edge (u, v) is in G'. The linear program to find the solution G' is formulated as follows.
Expressions (7) and (8) state the condition that y_{uv }= 1 if and only of x_{u }= x_{v }= 1. Expression (8) ensures that, for any edge whose end points (u, v) are chosen to be in G', y_{uv }is set to 1. Due to the use of y_{uv }variables, this formulation requires O(UV) variables and constraints.
Improved linear programming
Observe that constraint (7) becomes trivial if y_{uv }= 0. In other words, this constraint formulates implications, e.g., for binary variables p and q, the expression p ≤ q is equivalent to p → q. Expanding on this idea, we eliminate the requirement of variables y_{uv }in constraints (9) and (10) in the next formulation while sharing the rest of the aforementioned linear program.
There is a variable x_{v }for every vertex v in G. There is a variable y_{uv }for every edge (u, v) in G whose weight is not 0. The variable y_{uv }is set to 1 if and only if both x_{u }and x_{v }are set to 1. For any vertex u ∈ U (v ∈ V), the variable x_{u }(x_{v}) is set to 1 if and only if vertex u (v) is in G'. Constraint (12) can also be explained as follows. If x_{u }= 1, the constraint transforms to the second constraint in the α,βWQB Definition. If x_{u }= 0, constraint (12) becomes trivial. Constraint (13) can be explained in a similar manner.
Generalized formulation for α,βWQB_{P }and α,βWQB_{C}
Recall that the difference between the two problems α,βWQB_{P }and α,βWQB_{C }is in the edge weight summation which we can combine as the following properties: (1) ∀u ∈ U' : ∑_{v∈V' }ω(u, v) ≥ α_{P }V'  α_{C}, and (2) ∀v ∈ V' : ∑_{u∈U' }ω(u, v) ≥ β_{P }U'  β_{C}, where (α_{P}, β_{P }) and (α_{C}, β_{C }) are the parameters given in α,βWQB_{P }and α,βWQB_{C }respectively. Following the same reasoning in the previous paragraphs, linear constraints (12) and (13) are now updated as the following.
As a results, the problem instance is a α,βWQB_{C }problem if (α_{P}, β_{P }) = (1, 1), and it is a α,βWQB_{P }problem if (α_{C}, α_{C}) = (0, 0). Note that the formulation does not require either condition to present; it essentially defines a generalization of α,βWQB problems when all 4 parameters are valid and nonzero.
If there are n vertices in U and m vertices in V, there will be a total of m + n + 2k constraints and m + n + k variables where k is the number of edges whose weight is not equal to 0. The above formulations can be extended to solve the query problem by adding an additional constraint x_{v }= 1 to the formulation, for every vertex v ∈ P ∪ Q. Similar constraints also help us explore sub optimal solutions, e.g., excluding known vertices in subsequent solutions, or provide a lowerbound of required query items in the optimal solution.
Conclusions
We address noise and incompleteness in biological networks by introducing graphtheoretical optimization problems that identify variations of novel weighted quasibicliques. These quasibiclique problems incorporate biological interaction levels in different analytical settings and exhibit improvements over unweighted quasibicliques. To meet demands of biologists we also provide a query version of (weighted) quasibiclique problems. We prove that our problems are NPhard, and describe IP formulations that can tackle moderate sized problem instances. Simulations and empirical data solved by our IP formulation suggest that our weighted quasibiclique problems are applicable to various other biological networks.
Future work will concentrate on the design of algorithms for solving largescale instances of weighted quasibiclique problems within guaranteed bounds. Greedy approaches may result in effective heuristics that can analyze evergrowing biological networks. A practical extension to the query problem is the development of an efficient enumeration of all maximal α,βWQB's.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
WCC and SV were both responsible for developing the solution, carrying out experiments, and writing of the manuscript. RK performed the experimental evaluation, analysis, and contributed to the writing of the manuscript. OE supervised the project and contributed to the writing of the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10.
We thank Heiko Schmidt for discussions that initiated the concept of weighted quasibicliques. Further, we thank Nick Pappas, John Wiedenhoeft and anonymous reviewers for valuable comments. WCC, SV, and OE were supported in part by NSF awards #0830012 and #1017189.
References

Waksman G: Proteomics and ProteinProtein Interactions Biology, Chemistry, Bioinformatics, and Drug Design. Springer Verlag; 2005.

Goh K, Cusick M, Valle D, Childs B, Vidal M, Barabási A: The human disease network.
PNAS 2007, 104(21):8685. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu X, Li J, Wang L: Modeling protein interacting groups by quasibicliques: complexity, algorithm, and application.

Sim K, Li J, Gopalkrishnan V: Mining maximal quasibicliques: novel algorithm and applications in the stock market and protein networks.
Analysis and Data Mining 2009, 2(4):255273. Publisher Full Text

Liu H, Liu J, Wang L: Searching maximum quasibicliques from proteinprotein interaction network.

Ding C, Zhang Y, Li T, Holbrook S: Biclustering protein complex interactions with a biclique finding algorithm.

Li H, Li J, Wong L: Discovering motif pairs at interaction sites from protein sequences on a proteomewide scale.
Bioinformatics 2006, 22(8):989996. PubMed Abstract  Publisher Full Text

Dietrich B: Some of my favorite integer programming applications at IBM.
Annals of Operations Research 2007, 149:7580. Publisher Full Text

Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear E, Sevier C, Ding H, Koh J, Toufighi K, Mostafavi S, et al.: The genetic landscape of a cell.
Science 2010, 327(5964):425431. PubMed Abstract  Publisher Full Text

Peeters R: The Maximum edge biclique problem is NPcomplete.
Discrete Appl Math 2003, 131(3):651654. Publisher Full Text

Alexe G, Alexe S, Crama Y, Foldes S, Hammer PL, Simeone B: Consensus algorithms for the generation of all maximal bicliques.
Discrete Appl Math 2004, 145:1121. Publisher Full Text

Geva G, Sharan R: Identification of protein complexes from coimmunoprecipitation data.
Bioinformatics 2011, 27:111117. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yan C, Burleigh JG, Eulenstein O: Identifying optimal incomplete phylogenetic data sets from sequence databases.
Mol Phylogenet Evol 2005, 35(3):528535. PubMed Abstract  Publisher Full Text

Wang L: Near optimal solutions for maximum quasibicliques. In COCOON 2010. Volume 6196. LNCS; 2010::409418.

Chang WC, Vakati S, Krause R, Eulenstein O: Mining biological interaction networks using weighted quasibicliques. In ISBRA 2011. Volume 6674. LNCS; 2011::428439.

Gurobi Optimization Inc: Gurobi Optimizer 4.5.
2011.

Engel S, Balakrishnan R, Binkley G, Christie K, Costanzo M, Dwight S, Fisk D, Hirschman J, Hitz B, Hong E, et al.: Saccharomyces genome database provides mutant phenotype data.
Nucleic acids research 2010, 38(suppl 1):D433D436. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Garey MR, Johnson DS: Computers and Intractability: A Guide to the Theory of NPCompleteness. New York: W. H. Freeman; 1979.

Kellerer H, Pferschy U, Pisinger D: Knapsack Problems. Springer; 2004.