Abstract
Background
Analyzing interaction networks for functional characterization poses significant challenges arising from the noisy, incomplete, and generic nature of both the interaction data as well as functional annotation of molecules. Networkbased methods focus on interacting molecules (pairs or sets) occurring in close proximity to infer functional associations.
Results
In this paper we perform a formal comparative investigation of the relationship between functional coherence and topological proximity in networks. We investigate the problem of assessing the coherence of sets of biomolecules (or segments thereof) taking into account functional specificity as well as the distribution of functional attributes across entity groups. We also propose novel measures of topological proximity that are more robust to noisy and incomplete interaction data.
Conclusion
We derive the following results in this paper: (i) there exists strong correlation between functional similarity and topological proximity in various network abstractions, with domain interaction networks (DDIs) demonstrating higher correlation than protein interaction networks (PPIs); (ii) measures that quantify coherence among entire sets of proteins are superior to aggregates of known pairwise measures; and (iii) randomwalk based measures of topological proximity are better suited to existing interaction data. We validate our methods on diverse data, including experimentally and computationally derived PPIs and DDIs, as well as on sets of known biologically related groups of molecules.
Background
Analysis of interaction data generated from high throughput experiments takes a networkcentric view of functions of biological systems and the role of the underlying components. Recent advances in this area have focused on the development of computational tools for networkbased functional annotation [1], identification of functionally coherent modules [2], and relationship between network structure and function [3,4], among others. Network proximity and connectivity are also shown to be effective in identifying proteins that are implicated in similar phenotypes [5].
In this paper, we comprehensively investigate the relationship between topological and functional modularity in the context of two network abstractions  proteinprotein interaction (PPI) and domaindomain interaction (DDI) networks. Key to understanding the relationship between network topology and functional modularity are: (i) suitable measures for assessing the functional coherence (or similarity) of a group of entities with respect to each other, and (ii) measures for quantifying the topological proximity in a network with potential missing interactions and noise. To assess functional coherence, canonical libraries of molecular function, such as Gene Ontology [6], are typically used [7]. Since annotations for different types of molecular entities (e.g., proteins or domains) are derived in different ways [8], they have different implications with respect to their specificity and frequency distributions. Consequently, an important challenge in assessment of functional coherence is the development of measures that are robust to variations in distribution as well as missing data. In recent work, we have shown that informationtheoretic measures that are specifically designed to address these challenges are effective in capturing the relationship between the functional coherence and network proximity of pairs of proteins [9]. In this paper, we build upon existing methods for quantifying functional coherence and topological proximity through the following key results:
• We propose novel measures for assessing the functional coherence of a group of molecules (in contrast to pairs of molecules).
• We propose the use of information flow based modeling of topological proximity and connectivity in a network of interactions (in contrast to traditionally used interaction counts or shortest paths).
We elaborate on these contributions below.
Functional coherence of a group of molecules
Traditional measures of functional coherence, including our own prior results [9], have largely focused on pairwise distance measures. Generalizing from pairwise measures to coherence measures for sets of molecules adds significant complexity. For example, in testing the hypothesis that functional modularity is related to connectivity in PPI networks, it is common to investigate the functional purity of groups of proteins that induce dense subgraphs in the network [10]. While these enrichmentbased methods have been widely used, they provide common overrepresented GO terms in a given set. They do not, however, provide a measure for the homogeneity of underlying modules (sets). We show that simple extensions of pairwise measures to group measures by averaging, taking the min, max, or other such associative operations result in suboptimal setcoherence measures. We propose novel measures of homogeneity of entire protein sets and demonstrate their superiority over generalized pairwise measures on known groups of homogeneous complexes as compared to a control of randomly generated protein sets.
Information flow based topological proximity
Topological information is used to identify functionally related proteins using shortest paths or density of direct interactions [1,5]. However, evidence suggests that multiple alternate paths between functionally associated proteins are often conserved through evolution, owing to their contribution to robustness against perturbations, as well as amplification of signals [11]. Consequently, consideration of multiple paths between molecules in a network of interactions is likely to be more effective in capturing the functional association between these molecules. Furthermore, consideration of alternate paths may account for missing data and noise in PPI networks [12]. There exist many methods for the assessment of network proximity based on the multiplicity of paths between nodes, including effective resistance [13], commute distance [14], and random walk proximity [15]. In this paper, we adapt an abstraction that models information flow in the cell using random walks with restarts [16].
Methods
Several methods have been proposed for assessing functional similarity of biological entities (genes, proteins, domains) [1719]. Since the functional categories in which these entities are categorized are themselves interrelated through a taxonomy (e.g., Gene Ontology), measures for similarity must consider the underlying taxonomy while comparing molecules in terms of their functional annotation [20]. Various approaches take into account different factors, including taxonomic distance, specificity/generality (rank in hierarchy) of common ancestors, and associated number of molecules for the functional terms being compared (statistical significance or information content). Since most molecules are associated with multiple functional terms, assessment of functional similarity between two molecules poses the additional challenge of evaluating the similarity between two sets of terms, as opposed to a pair of terms. In [9], we developed an information theoretic measure for computing similarity of two sets of terms associated with a pair of molecules. We show that our measure is superior to other composite measures computed by applying associative operators (average, max, etc.) to pairwise term similarity measures.
In this paper, we generalize and extend our results to quantify the functional coherence (or similarity) of a set of biomolecules (as opposed to a pair). Since each molecule corresponds to a set of annotations, the problem is one of quantifying the coherence of a set of sets of terms. A straightforward approach to this would compute pairwise similarities of each pair of molecules in the set and to aggregate them using associative operators (min, max, average). Pairwise similarities (similarity of two sets of annotations) may themselves be computed using our information theoretic measure. An alternate approach to the problem, proposed in this paper, computes the coherence of the set of molecules without computing intermediate pairwise similarity scores. We show that the latter approach is strictly superior to the former in quantifying the coherence of a set of biomolecules. We validate this claim by applying our proposed measure, along with several other currently used measures to a test group of known functionally related proteins. We also apply the measures to randomly generated groups and identify measures that induce the greatest separation between the test and random groups.
Finally, in order to study the correlation between functional coherence and topological proximity in networks, we also need a measure for topological proximity. Traditional measures of topological proximity rely on the shortest path between two nodes. While this measure is more suited to wellcurated and complete datasets, it is susceptible to missing interactions and noise. A single false positive or negative may lead to significant (erroneous) perturbation in shortestpath based measures. Measures based on random walks with restart [16], on the other hand, are more resilient to incomplete and noisy data. We consider both classes of measures of topological proximity, and evaluate their correlation with various functional similarity measures for both protein interaction (PPI) and domain interaction (DDI) networks. We show that a combination of randomwalk based topological proximity and our similarity measure([9]) yield the strongest correlation between network proximity and functional coherence.
Concepts and ontologies
Let C = {c_{i}1 ≤ i ≤ N } be a finite partially ordered set of concepts. In terms of Gene Ontology (GO), these concepts represent the GO terms in the subontologies (i.e., molecular function, biological process, and cellular component). Without loss of generality, we refer to concepts as terms throughout this paper. Terms are related to each other through is a and part of relationships, such that c_{i }→ c_{j }denotes c_{i }is a/part of c_{j}. Note that, if c_{i }→ c_{j}, then the molecules associated with c_{i }are also associated with c_{j}, known as the true path rule. Based on these relationships, we define a binary relation over C, denoted by ≼. We say c_{j }is an ancestor of c_{i}, denoted by c_{i }≼ c_{j }if and only if either c_{i }→ c_{j}, or for some ℓ ≥ 1, there exist for 1 ≤ ℓ ≤ 1 such that for 1 ≤ ℓ < l, and (c_{j }is an ancestor of c_{i }in GO hierarchy). Two terms c_{i}, c_{j }are comparable, denoted by c_{i }~ c_{j}, if either c_{j }≼ c_{i }or c_{i }≼ c_{j}. If c_{i }and c_{j }are comparable, then the shortest path between c_{i }and c_{j }is given by L(c_{i}, c_{j}) = L(c_{j}, c_{i}) = ℓ + 1 for minimum such ℓ.
We denote the set of ancestors of a term c_{i }by A_{i }= {c_{k }∈ Cc_{i }≼ c_{k}}. Note that, not all ancestors of a term are comparable, since the GO hierarchy is a directed acyclic graph, as opposed to a tree. We represent the root term of GO with a terminal concept r, such that c_{i }≼ r ∀c_{i }∈ C.
Semantic similarity of terms
Semantic similarity measures quantify the similarity between two terms based on the underlying taxonomical relationships. The information content based measure of semantic similarity quantifies similarity between a pair of terms by taking into account the distribution of terms among molecules. Specifically, it rewards infrequent similar terms, over those that are frequent. Let G_{c }be the set of molecules associated with term c in the available database, with G_{r }being the set of all molecules. The information content of a term is defined as I(c) =  log2(G_{c}/G_{r}) [20]. Clearly, I(r) = 0, and as a consequence of the true path rule, I(c_{j}) ≥ I(c_{i}) for c_{j }≼ c_{i}. Then, the semantic similarity between two terms is defined as
Here, is said to be the minimum common ancestor of c_{i }and c_{j}.
Observe that this measure does not take into account the specificity of terms with identical common ancestors. This problem can be alleviated by normalizing the similarity between two terms by the selfsimilarities of the terms being compared, e.g., by [21]. Note, this measure has a well defined maximum of 1 and offer bounded interpretation (ranging from 0 to 1) of Resnik's metric. We now generalize these termsimilarity measures to setsimilarity.
Functional similarity of molecules
Biomolecules are generally associated with multiple molecular functions and often involved in multiple processes. Consequently annotations of molecules correspond to sets of terms, as opposed to individual terms. While assessing the similarity of sets of terms, we assume that the sets are nonredundant, i.e., each set consists of terms that are not comparable. This can be easily enforced by ensuring that each branch in the hierarchy is represented by at most one term in each set. In GO, this involves considering only the most specific annotations associated with a gene, which provides a nonredundant representation of functional annotation. In this representation, the association between the gene and the ancestors of the most specific term is implied by the true path rule.
An important challenge in the assessment of the functional coherence of sets is that these sets are often incomplete (that is, for many molecules, some of their functions are unknown). Therefore, a reliable measure is one that rewards the abundance of similar terms in the terms, but does not penalize existence of unrelated terms in one of the sets, since the relation between these terms and the other set may be currently unknown. Simple associative measures that aggregate the similarity of pairs of terms in the two terms, such as average (ρ_{A}) [17], maximum (ρ_{M}) [22], or average of maximums (ρ_{H}) [18] do not satisfy these properties [9].
Motivated by these considerations, in prior work, we extend the notion of minimum common ancestors to sets of terms, and generalize the concept of information content from a single term to a set of terms [9]. Let be the minimum common ancestor set of term sets S_{i }and S_{j}, and ⊔ denote a generalized union operator that preserves nonredundancy by keeping the most specific terms. The similarity between two term sets is defined as the information content of the set of minimum common ancestors, i.e.,
where is the set of molecules that are associated with all terms in the set Λ(S_{i}, S_{j}). Note that ρ_{I }also needs to be normalized with respect to self similarities, i.e., ρ_{JC }= 1/(ρ_{I}(S_{i}, S_{i}) + ρ_{I}(S_{j}, S_{j})  2ρ_{I}(S_{i}, S_{j}) + 1)
Functional coherence of modules
Let ℛ be a set of n molecular entities (genes, proteins, domains), with each entity being associated with a set of terms, i.e., ℛ = {S_{1}, S_{2}, ..., S_{n}}. We aim to develop a measure σ(ℛ) to assess the functional coherence of this set, such that a larger σ indicates more semantic similarity between the terms in sets S_{1}, S_{2}, ..., and S_{n}. Without loss of generality, we call ℛ a module, since the objective here can also be considered as assessing the modularity of ℛ. We consider various measures to assess the functional coherence of a module, which are discussed below. In order to illustrate each measure, we use a running example based on the ontology shown in Figure 1. In the figure, let ℛ_{1 }= {S_{1}, S_{2}, S_{3}, S_{4}} be a module that can be interpreted as a complex composed of two subcomplexes ℛ_{2 }= {S_{1}, S_{2}, S_{3}} (with the shared term c_{4}) and ℛ_{3 }= {S_{3}, S_{4}} (with the shared term c_{6}), in which S_{3 }"bridges" the two subcomplexes ℛ_{2 }and ℛ_{3}.
Figure 1. Sample ontology. S_{1 }= {c_{4}}, S_{2 }= {c_{4}}, S_{3 }= {c_{4}, c_{6}}, S_{4 }= {c_{1}, c_{6}}, S_{5 }= {c_{1}}, S_{6 }= {c_{6}}. Sample ontology and annotations. Each node of the hierarchy represents a term, each set represents a protein.
Average of pairwise information content
A straightforward way of computing set coherence is to compute the average of the pairwise n(n  1)/2 set similarity scores [19,23]:
In our running example, the average pairwise information content of the molecules in complex ℛ_{1 }is given by σ_{A}(ℛ_{1}) = (I(c_{4}) + I(c_{4})/2 + 0 + I(c_{4})/2 + 0 + I(c_{6})/4)/6 = 3/8, while that of subcomplex ℛ_{2 }is given by σ_{A}(ℛ_{2}) = (I(c_{4}) + I(c_{4})/2 + I(c_{4})/2)/3 = 2/3, given that I(c_{4}) = I(c_{6}) = log_{2 }(3/6) = 1. Bridged complexes get lower score than specialized complexes due to differences in subcomplex annotations.
Generalized information content
It is possible to extend the notion of the minimum common ancestor of pairs of terms to tuples of terms as . In the other words, the minimum common ancestor of a set of n terms is defined as the most specific among the terms that are common ancestors of all of n terms in the set. Then, for each ntuple c_{1 }∈ S_{1}, c_{2 }∈ S_{2}, ..., and c_{n }∈ S_{n}, the functional coherence of these terms can be quantified as . Consequently, the minimum common ancestor set of S_{1}, S_{2}, ..., S_{n}can be computed as
leading to a generalization of the information content based measure:
In our running example, since λ(c_{4}, c_{1}) = λ(c_{4}, c_{6}) = λ(c_{4}, c_{1}, c_{6}) = r, we have Λ(ℛ_{1}) = {r}, thus the generalized information content of complex ℛ_{1 }is σ_{I}(ℛ_{1}) = I(r) = 0. On the other hand, since Λ(ℛ_{2}) = {c_{4}}, we have σ_{I}(ℛ_{2}) = I(c_{4}). As illustrated by this example, σ_{I }is a rather conservative measure of functional coherence and it only rewards specialized modules in which all molecules share very similar functions.
Graph information content
We extend the graph information content measure proposed by Pesquita et al. [24]. The idea behind this approach is that, if a group of molecules are coherent, then the information content of the DAG induced by the intersection of ancestors is close to the information content of the DAG induced by the union of ancestors. In other words, defining as the ancestor set S_{i}, graph information content of set ℛ is defined as
Observe that, if all molecules are annotated with the same set of terms, σ_{G}(ℛ) would be equal to one, and zero if they have no common terms. Similar to σ_{I}, a drawback of this measure is its sensitivity to outliers; that is, if a single molecule in the set is sufficiently functionally different it has a significant impact on the score. Indeed, in our running example, we have σ_{G}(ℛ_{1}) = I(r) = 0, while σ_{G}(ℛ_{2}) = (I(c_{4}) + I(c_{2}))/(I(c_{4}) + I(c_{2}) + I(c_{6}) + I(c_{3})) = 1/2, given that I(c_{2}) = I(c_{3}) = log_{2 }(3/6) = 1.
Weighted information content
Complexes are functionally cohesive modules, but they are often composed of subcomplexes, each performing a specific part of the general function of the complex [25]. However, as illustrated by our running example, generalized information content (σ_{I}) and graph information content (σ_{G}) require all molecules to be functionally coherent with each other for the module to be considered coherent. In order to provide a more relaxed, and biologically motivated measure of functional coherence, we consider shared functionality between all combinations of molecules and weigh the information content of shared functionality by the number of molecules that contribute to the shared functionality.
Specifically, let be the set of terms in the ancestor set of S_{i }that are not shared with any other molecule in ℛ. Then, weighted information content of set ℛ is defined as the ratio of the information content of all terms that are shared in at least two molecules to the information content of all terms associated with at least one molecule in the set; that is:
In other words, we consider all the partial DAGs () generated by each S_{i }in ℛ. All the terms that are part of overlapping DAG correspond to shared information among those proteins. The numerator in the above equation corresponds to the information content of the overlapping DAG, while the denominator normalizes that score with total information of the combined DAG. In our running example, we have
σ_{W }(ℛ_{1}) = (3I(c_{4}) + 3I(c_{2}) + 2I(c_{6}) + 2I(c_{3}))/(3I(c_{4}) + 3I(c_{2}) + 2I(c_{6}) + 2I(c_{3}) + I(c_{1})) = 0.86 and σ_{W }(ℛ_{2}) = (3I(c_{4}) + 3I(c_{2}))/(3I(c_{4}) + 3I(c_{2}) + I(c_{6}) + I(c_{3})) = 3/4, given that I(c_{1}) =  log_{2 }(2/6) ≈ 1.6 Since members of the module ℛ_{1 }share all functions other than c_{1}, this measure captures the coherence of the bridged module better than other methods. This method only penalizes for functions which are not shared by a member with rest of the module.
Postprocessing coherence scores
We now discuss how coherence scores are processed to make them comparable against each other for different module sizes and across various subontologies.
Combination of subontology scores
The scores discussed above can be based on any of the three subontologies of GO. Since cellular component annotations are sparser than annotations of biological process and molecular function, we use the method proposed by Schlicker et al. [26]. For pairs of molecules, we combine the two coherence scores obtained from biological process and molecular function ontologies as:
where max ρ^{(BP) }and max ρ^{(MF) }are the maximum possible scores for biological process and molecular function, respectively. Module coherence scores (σ) are based only on biological process ontology.
Accounting for module size
In order to compare modules of different sizes, we normalize the functional coherence scores based on a background distribution that characterizes the coherence of modules of identical size. Specifically, for a given module ℛ, we generate a sufficiently large number of random modules of size ℛ and compute the functional coherence of each of these modules. Then, letting denote the average functional coherence of these modules, we compute the sizeadjusted coherence score of ℛ as .
Index of detectability
In order to compare various measures of functional coherence, we assemble a positive (test) group and a randomly selected (control) group of proteins. The positive set comprises of proteins that are known to be functionally related based on prior biological knowledge (i.e., they are known to exist in complexes and perform related functions). Clearly, if we plot coherence values for samples from the test set and from the control set, we expect to see two distinct distributions  samples from the test group are expected to have higher coherence scores than those from the control group. The separation of the two distributions induced by each method indicates the fitness of the measure in quantifying coherence in sample sets, in terms of distinguishing coherent and arbitrary sets. This separation is quantified as:
which is proportional to the area under the binormal ROC curve [27]. Here, T and C denote the sets of test and control modules, respectively.
Measure for topological proximity
The most commonly used measure of topological proximity is graph distance, where the distance between a pair of nodes in a connected graph is defined as the length of the shortest path between them. In the context of biological networks, there are several drawbacks to this measure. It is particularly susceptible to missing or incorrect data  i.e., a single missing edge may reduce proximity significantly, alternately, a single false edge may increase proximity incorrectly [28]. Furthermore, this measure does not take into account the global structure and connectedness of the graph, with alternate paths between a pair of nodes.
Nodes connected to each other via disjoint paths are likely to be functionally closer than nodes that are connected via a single path. Indeed, evidence suggests that multiple alternate paths between functionally associated proteins are often conserved through evolution, owing to their contribution to robustness against perturbations, as well as amplification of signals [11].
To alleviate these drawbacks, we consider an alternate measure that captures the multifaceted relationship between a pair of nodes [16]. This measure uses a random walk with periodic restarts to estimate the affinity between pairs of nodes. In this model, the random walk is initiated at node i, with neighbor transition probability proportional to edge weight, and at each step, the walk returns to source node i with probability c. The proximity of node j to node i is defined as the relative amount of time spent at node j by such an infinite random walk. It can be shown that the proximity of all nodes to node j can be computed iteratively as
Here, W is the stochastic matrix derived from the adjacency matrix of the network, is the restart vector with if j = i and 0 otherwise, and . Then, the proximity of node j to node i is given by . Repeating this procedure for all proteins, we obtain a matrix of network proximity scores for all pairs of proteins. Note, however, that this measure of proximity is not symmetric (proximity of j to i is not necessarily equal to the proximity of i to j). Therefore, we take the average of the two proximity values to compute the proximity between a pair of proteins. Using the proposed measures of functional coherence and the randomwalk based measure for topological proximity, we quantify the relationship between topological proximity and functional coherence by computing the correlation of the resulting matrices.
Materials
We obtain protein interaction data for S. cerevisiae and S. pombe, from the BioGRID database [29] version 2.0.51. We filter the dataset to obtain a set of physical interactions between proteins, i.e., genetic interactions are removed based on experiment systems (e.g., knockout experiments) mentioned on the BioGRID website. Integr8 [30] is used to map the proteins in the interaction dataset to their Uniprot names, keeping only those proteins that we can map to a Gene Ontology term using Integr8.
We obtain domain interaction data from the DOMINE database [31] version 1.1. This dataset is composed of known, as well as predicted domain interactions. Based on the source and quality of the data, we partition this dataset. Struct interactions are inferred from PDB entries of protein complexes and are collected from iPfam and 3did. Comp2 interactions are predicted by at least two computational methods that infer domain interactions from protein interaction networks using techniques such as maximum likelihood estimation or from coevolution of conserved sites in protein sequences. HC+MC interactions consists of high and medium confidence interactions (for details, please refer to [31]).
To test the functional coherence of sets, we obtain positive and random cases from GRIP [32]. GRIP generates positive cases from MIPS CYGD complex catalogue [33] by picking sets from known complexes. For wildtype cases, GRIP selects proteins at random. We generate a total of 16 datasets of which eight are made up of positive cases and eight are random. Each set consists 2000 sets of proteins (complexes), ranging from four to eleven proteins each.
Gene Ontology Annotation (GOA) [34] release 47.0 dated 2009/03/09 is used to obtain annotation information for Uniprot proteins. GOA combines manual and automated inferences of gene product annotations. The mapping of PfamA domains to their Gene Ontology functions is obtained from pfam2go http://www.geneontology.org/external2go/pfam2go webcite released on 2009/03/04. We use only the Biological Process and Molecular Function subontologies of Gene Ontology [6] version 1.550 for evaluation, since the coverage for the Cellular Component subontology is relatively sparse.
Results and discussion
We first compare the behavior of the molecular similarity metrics by examining their correlation with different topological proximity measures, and follow with a detailed look at their behavior on comprehensive PPI and DDI data. We then investigate the differences between PPI and DDI networks in terms of the relationship between network proximity and functional similarity using our generalized information content based metric. Finally we compare various measures for computing the functional coherence of sets. To evaluate similarity visavis proximity, we compute, for every pair of nodes in a network, the shortest distance between them, proximity for a given value of c, and various semantic similarity measures. Using these, we compute correlation of topological proximity metrics and functional similarity measures. As in [9] we normalize raw similarity scores to obtain a mean similarity score of zero and standard deviation of one. We create groups of pairs based on their topological proximity and compute average semantic similarity for each group.
Topological proximity and semantic similarity measures
We first evaluate the proximity measure based on random walks. Since the parameter c can be varied to perturb the affinity between nodes, we first estimate an optimal value for c. We compute the proximity matrix for various values of c, ranging from 0.1 to 0.9, for the domain network HC+MC. We also compute the semantic similarity scores for different metrics  average of information content (IC) based term similarity (ρ_{A}/δ_{I}), average of selfnormalized IC based term similarity (ρ_{A}/δ_{JC}), IC based molecule similarity (ρ_{I}), and selfnormalized IC based molecule similarity (ρ_{JC}). We compute the correlation between these computed functional similarity scores and topological proximity. Semantic similarity is computed for the biological process (BP) and molecular function (MF) ontology separately, as well as by combining the two scores.
As evident in Figure 2(a), for c = 0.3, we obtain the best correlation between the proximity matrix and any semantic similarity metric using combined BP and MF ontology. For further analyses, we use this value (c = 0.3) to compute the proximity matrix. In this network we also note that topological proximity (c = 0.3) has much better correlation with functional similarity than shortest path, for all similarity metrics. This validates our proposed use of randomwalk based topological proximity measure. Indeed, this behavior follows our hypothesis that since proximity takes into account all paths from one node to another, two nodes connected in multiple ways are expected to be more functionally similar.
Figure 2. Comparing topological metrics. Comparison of different network distance measures in terms of their behavior with respect to semantic similarity metrics in HC+MC domain network, using the (a) for various values of c and shortest path (b) effect of ontologies.
In Figure 2(b) we plot the correlation of topological proximity and semantic similarity measures using BP, MF and both ontologies. BP offers slightly better correlation than MF. In general, MF corresponds to a lower level property of a molecule directly related to its structure. BP is a higher level construct, related to the wider neighborhood in the network. Hence interacting molecules are more likely to belong to the same processes even if they have different functions. Finally the correlation obtained by combining the two ontologies is higher than taking them separately.
Topological proximity and functional similarity in networks
Using the measure ρ_{JC }by combining BP and MF ontology, we compare the relationship between functional similarity, random walk based network proximity (Figure 3a), and network distance (Figure 3b). We plot the normalized average semantic similarity, as in the previous case, for the PPI and DDI networks for various groupings of proximity values and shortest path distances. Each bin in Figure 3a is adjusted such that the number of pairs in each bin in Figure 3a is approximately equal to that in Figure 3b. As evident in the figure, the larger the proximity (between a pair of nodes) the (more) similar their functions. Conversely, lesser the distance between a pair of nodes, higher their similarity. Larger the slope between the groupings the better the measure performs (or dataset is) in grouping similar functioned molecules together. For both proximity measures, we find that DDIs have better functional similarity than PPIs, as also noted in [9]. Further, it is apparent that the relationship between functional similarity and topological proximity is stronger in computationally inferred DDI networks than that in PPI networks. Among the PPI networks, S. cerevisiae, which is the most completely annotated and studied, we observe stronger correlation between functional similarity and both proximity measures, compared to other PPIs.
Figure 3. Comparing PPIs and DDIs. Comparison of various networks with respect to the relation between semantic similarity and (a) network proximity and (b) network distance.
Correlation of proximity measures and similarity in Figure 2a provides comparison of the curves in Figures 3a and 3b. Further comparison of Figures 3a and 3b indicates that the slopes of the curves are are generally higher for random walk based proximity, as compared to shortest path. Again, since proximity binds two nodes not just on topological distance but also on number of paths in between, only strongly correlated nodes have higher proximity values. These observations suggest that network proximity based on random walks is likely to be more relevant to, hence indicative of, functional coherence and modularity.
Comparing measures for sets
We evaluate coherence measures for sets using index of detectability on sets with (known) functionally correlated protein and sets of randomly selected proteins. We compute functional coherence using the following measures: average of pairwise information content (using term σ_{A}/ρ_{A}/δ_{I }and molecule σ_{A}/ρ_{I }based similarity), Generalized Information Content (σ_{I}), Extended graph information content (σ_{G}), and Weighted information content (σ_{W}). As we observe in the previous section, since biological processes span wider neighborhoods, they are more likely to be shared in a module. For this reason, we compute the coherence score using only the biological process ontology. As the index of detectability is a measure of significance, we can plot a curve indicating the threshold corresponding to pvalue < 0.05.
Figure 4 shows the index of detectability for various measures as the size of modules is varied from 4 to 11. We note that average of pairwise similarity based on molecule IC performs best from small modules, and that its performance remains steady as module size increases. Extended graph information content performs the worst and its performance decreases drastically as module size increases. As the module size increases, we expect the complex to be composed of subcomplexes with specific function, while the overall functionally shared among all molecules in this complex may be general. We see similar behavior in the generalized information content measure. The weighted information content based measure demonstrates improved performance as set size increases. This is because it can detect all shared functionality among subcomplexes that are parts of the entire complex, and have overlaps or bridges among them to carry out the biological tasks. Figure 4 also displays a curve indicating the threshold on index of detectability that corresponds to a statistical significance of p < 0.05 (according to normal distribution). This curve shows that only the weighted information content and pairwise molecule based similarity metric deliver significant performance in distinguishing known complexes and random sets of genes (p < 0.05), and the performance of the proposed measure increases with increasing complex size.
Figure 4. Coherence in sets. Comparison of detectability index for various coherence measures and complex sizes.
Conclusion
We draw the following conclusions from our study: (i) our proposed measure of functional coherence of sets of entities (proteins, domains) is superior to other existing measures, (ii) we comprehensively study the relationship between functional coherence and topological proximity using suitable measures and derive formal conclusions for process and function based annotations, and (iii) we use our measures to study a range of PPIs and DDIs and establish the suitability of these abstractions to various kinds of analyses.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JP developed the methods and performed all analyses. The three authors participated in the conception of the methods and the analysis. Together, the three authors wrote this manuscript.
Acknowledgements
This research was supported by NSF grants 0800568, 0835541, and 0641037.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth AsiaPacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S1.
References

Sharan R, Ulitsky I, Shamir R: Networkbased prediction of protein function.
Mol Syst Biol 2007., 3 Publisher Full Text

Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks.
PNAS 2003, 100(21):1212312128. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yook SH, Oltvai ZN, Barabási AL: Functional and topological characterization of protein interaction networks.
Proteomics 2004, 4:928942. PubMed Abstract  Publisher Full Text

Przulj N, Wigle DA, Jurisica I: Functional topology in a network of protein interactions.
Bioinformatics 2004, 20:340348. PubMed Abstract  Publisher Full Text

Lage K, Karlberg OE, Størling ZM, Páll , Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N, et al.: A human phenomeinteractome network of protein complexes implicated in genetic disorders. [http://dx.doi.org/10.1038/nbt1295] webcite
Nature Biotechnology 2007, 25(3):309316. PubMed Abstract  Publisher Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium.
Nature Genetics 2000, 25:2529. PubMed Abstract  Publisher Full Text

Khatri P, Drãghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems.
Bioinformatics 2005, 21:35873595. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schug J, Diskin S, Mazzarelli J, Brunk B, Stoeckert C: Predicting gene ontology functions from ProDom and CDD protein domains.
Genome Res 2002, 12:648655. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pandey J, Koyuturk M, Subramaniam S, Grama A: Functional coherence in domain interaction networks. [http://dx.doi.org/10.1093/bioinformatics/btn296] webcite
Bioinformatics 2008, 24(16):i2834. PubMed Abstract  Publisher Full Text

Grossmann S, Bauer S, Robinson PN, Vingron M: An Improved Statistic for Detecting OverRepresented Gene Ontology Annotations in Gene Sets.
10th International Conference on Research in Computational Moecular Biology (RECOMB'06) 2006, 8598.

Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. [http://dx.doi.org/10.1038/nbt1096] webcite
Nature Biotechnology 2005, 23(5):561566. PubMed Abstract  Publisher Full Text

Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR, Ideker T: Conserved pathways within bacteria and yeast as revealed by global protein network alignment.
PNAS 2003, 100(20):1139411399. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tetali P: Random walks and the effective resistance of networks.
Journal of Theoretical Probability 1991, 4:101109. Publisher Full Text

Chandra AK, Raghavan P, Ruzzo WL, Smolensky R: The electrical resistance of a graph captures its commute and cover times. In In STOC '89: Proceedings of the twentyfirst annual ACM symposium on Theory of computing. New York, NY, USA: ACM; 1989:574586. Publisher Full Text

Fouss F, Pirotte A, Saerens M: A Novel Way of Computing Similarities between Nodes of a Graph, with Application to Collaborative Recommendation. In In WI '05: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. Washington, DC, USA: IEEE Computer Society; 2005:550556. Publisher Full Text

Tong H, Faloutsos C, Pan JY: Random walk with restart: fast solutions and applications.
Knowl Inf Syst 2008, 14(3):327346. Publisher Full Text

Lord P, Stevens R, Brass A, Goble C: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.
Bioinformatics 2003, 19:12751283. PubMed Abstract  Publisher Full Text

Schlicker A, Huthmacher C, Ramírez F, Lengauer T, Albrecht M: Functional evaluation of domaindomain interactions and human protein interaction networks.
Bioinformatics 2007, 23:859865. PubMed Abstract  Publisher Full Text

Chagoyen M, Carazo JM, PascualMontano A: Assessment of protein set coherence using functional annotations.
BMC Bioinformatics 2008, 9:444. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. [http://citeseer.ist.psu.edu/resnik95using.html] webcite

Jiang JJ, Conrath DW: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy.

Sevilla J, Segura V, Podhorski A, Guruceaga E, Mato J, MartínezCruz L, Corrales F, Rubio A: Correlation between gene expression and GO semantic similarity.
IEEE/ACM Trans Comput Biol Bioinform 2005, 2:330338. PubMed Abstract  Publisher Full Text

Pu S, Vlasblom J, Emili A, Greenblatt J, Wodak SJ: Identifying functional modules in the physical interactome of Saccharomyces cerevisiae.
Proteomics 2007, 7:944960. PubMed Abstract  Publisher Full Text

Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation.
BMC Bioinformatics 2008, 9(Suppl 5):S4. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Gavin AC, SupertiFurga G: Protein complexes and proteome organization from yeast to man.
Curr Opin Chem Biol 2003, 7:2127. PubMed Abstract  Publisher Full Text

Schlicker A, Rahnenfuhrer J, Albrecht M, Lengauer T, Domingues F: GOTax: investigating biological processes and biochemical activities along the taxonomic tree. [http://genomebiology.com/2007/8/3/R33] webcite
Genome Biology 2007, 8(3):R33. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Simpson A, Fitter M: What is the best index of detectability?
Psychological Bulletin 1973, 80(6):481488. Publisher Full Text

Edwards AM, Kus B, Jansen R, Greenbaum D, Greenblatt J, Gerstein M: Bridging structural biology and genomics: assessing protein interaction data with known complexes.
Trends Genet 2002, 18:529536. PubMed Abstract  Publisher Full Text

Breitkreutz B, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner D, Bähler J, Wood V, et al.: The BioGRID Interaction Database: 2008 update.
Nucleic Acids Res 2007. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I, et al.: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes.
Nucleic Acids Res 2005, 33:297302. Publisher Full Text

Raghavachari B, Tasneem A, Przytycka T, Jothi R: DOMINE: a database of protein domain interactions.
Nucleic Acids Res 2007. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Browne F, Wang H, Zheng H, Azuaje F: GRIP: A webbased system for constructing Gold Standard datasets for proteinprotein interaction prediction.
Source Code Biol Med 2009, 4:2. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Güldener U, Müunsterkötter M, Kastenmüller G, Strack N, van Helden J, Lemer C, Richelles J, Wodak SJ, GarcíaMartínez J, PérezOrtín JE, et al.: CYGD: the Comprehensive Yeast Genome Database.
Nucleic Acids Res 2005, 33:D364368. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Barrell D, Dimmer E, Huntley RP, Binns D, O'Donovan C, Apweiler R: The GOA database in 2009an integrated Gene Ontology Annotation resource.
Nucleic Acids Res 2009, 37:396403. Publisher Full Text