Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, PR China

Department of Physics, Pennsylvania State University, University Park, PA 16802, USA

National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 100190, PR China

Abstract

Background

As protein domains are functional and structural units of proteins, a large proportion of protein-protein interactions (PPIs) are achieved by domain-domain interactions (DDIs), many computational efforts have been made to identify DDIs from experimental PPIs since high throughput technologies have produced a large number of PPIs for different species. These methods can be separated into two categories: deterministic and probabilistic. In deterministic methods, parsimony assumption has been utilized. Parsimony principle has been widely used in computational biology as the evolution of the nature is considered as a continuous optimization process. In the context of identifying DDIs, parsimony methods try to find a minimal set of DDIs that can explain the observed PPIs. This category of methods are promising since they can be formulated and solved easily. Besides, researches have shown that they can detect specific DDIs, which is often hard for many probabilistic methods. We notice that existing methods just view PPI networks as simply assembled by single interactions, but there is now ample evidence that PPI networks should be considered in a global (systematic) point of view for it exhibits general properties of complex networks, such as 'scale-free' and 'small-world'.

Results

In this work, we integrate this global point of view into the parsimony-based model. Particularly, prior knowledge is extracted from these global properties by plausible reasoning and then taken as input. We investigate the role of the added information extensively through numerical experiments. Results show that the proposed method has improved performance, which confirms the biological meanings of the extracted prior knowledge.

Conclusions

This work provides us some clues for using these properties of complex networks in computational models and to some extent reveals the biological meanings underlying these general network properties.

Background

Recently, researchers have confirmed that most proteins perform their functions through physically binding to other proteins, permanently or transiently. These interactions can be represented as a protein-protein interaction (PPI) network with each node corresponding to a protein and each edge an interaction. The development of high-throughput technologies, such as yeast two-hybrid screening methods

In general, proteins consist of one or more structural domains. A PPI is usually carried out through domain-domain interactions (DDIs). While PPIs are not so conserved among species, the recognition patterns of domain pairs are often shared within organisms

From a computational perspective, these methods fall into two categories. In the first category, they try to find pairs of domains that co-occur significantly more often in interacting protein pairs than in non-interacting pairs. The association method

The second category, different from the probabilistic framework, often models the issue as a combinatorial optimization problem. The idea is that an observed PPI can be explained by at least one pair of interacting domains involved, then they try to explain observed interacting protein pairs using a minimal number of domain pairs (the minimal spanning set), namely, the parsimony based approaches

Although the problem is thoroughly studied these years, we realize that existing models only make use of the local information of PPI networks (assembled single interactions). There is now ample evidence that PPI networks should be considered in a global (systematic) point of view for it exhibits some general properties of complex networks. 'Complex Networks' is an emerging concept that unifies networks appearing in different disciplines, such as social networks, information networks, and biological networks

Besides, although the parsimony principle is widely used in computational biology, few work has been done to verify its rationality quantitatively. Here, we investigate the parsimony nature of the organization of DDIs in mediating PPIs through randomization-based testings, which justifies the parsimony assumption from a computational perspective.

Methods

Parsimony based methods

Zhang et al.

We denote the observed protein-protein interaction network as _{1}_{2},..., _{N}_{i}_{j}_{i}_{m}_{j}_{n}_{m}_{n}

Here, we use (_{m}_{n}_{mn}_{mn}_{m}_{n}

Guimaraes et al. proposed a model with the same idea as

They modeled the noise in the protein-protein interaction data by selecting the constraints randomly according to a reliability probability ^{w(i, j)}}. _{ij}_{ij}^{w(i, j) }denotes the probability that all PPIs corresponding to witnesses are false positives. This term is useful for removing promiscuous domain-domain interactions that are scored high only because of their appearance frequency.

The aforementioned methods utilize a common computational assumption, namely, parsimony principle. In fact, the parsimony principle has been widely used in computational biology due to its biological/evolutional implication and intuitive simplicity. For example, parsimony strategy has been used in haplotype inference

The parsimony essential of PPIs

To verify the parsimony assumption in the context of predicting DDIs, we design two randomization testings. The parsimony principle here is to use a minimal number of DDIs to explain the observed PPIs. We define a null model in which there is no evolutional optimization process in organizing the protein domain composition and protein-protein interactions and compute the minimal number of such DDIs through (Eq. 5-7). To achieve this, the original data set is shuffled randomly. In order to simplify the argument, we define a random variable _{0 }is the corresponding value computed from the original data. So, under the null model, we expect to see a significant larger _{0}. Particularly, the original data set is shuffled with two different rules. The first rule shuffles the protein domain composition while the PPIs are conserved (For each protein, the number of constituent domains is conserved), and conversely, the second rule shuffles the PPIs while maintaining the composition (the degree distribution of the PPI network is conserved). The PPIs of _{0 }= 12663 on this data set. The distribution of _{0 }(In both cases,

PPIs and protein domain compositions are parsimoniously organized in nature

**PPIs and protein domain compositions are parsimoniously organized in nature**. Under each null model, 200 data sets are simulated. The distribution of

Motivation

Considering that it is intractable to directly integrate 'small-world' or 'scale-free' properties into the model as they are both statistical descriptions, we turn to consider the clustering coefficient

For vertices with degree 0 or 1, both the numerator and denominator are zero, so define _{i}

In terms of social networks, a large clustering coefficient implies the friend of your friend is likely also to be your friend. In many real complex networks, the clustering coefficient tends to be a non-zero number when the size of the network grows, while in random networks, it tends to be zero.

In the definition above, nodes with small degree contribute larger values to the global clustering coefficient because they own smaller denominators (Eq. 8), so we can deduce that the existence of triangle structures connected to poor nodes (nodes with few neighbors) plays a crucial role in maintaining relatively large

We can also think it in a biological way. It is known that most proteins carry out their functions through physically binding to other proteins, rather than in an individual way. So proteins with few neighbors are more likely to form a tight complex with its neighbors, that is to say, its neighbors interact with each other. On the other hand, rich nodes are more likely to execute multiple functions under different cell types/conditions, and experimentally detected interactions associated to rich nodes are the union of these cell type/condition specific interactions, we can not deduce any interaction potential of those proteins connected to a rich node.

Among experimental PPIs, a large proportion are false positives, which hinders many computational models. As discussed above, from a network view and biological intuition, we reason that detected interactions centering on a poor node are more likely to be true positives.

Weighted integer linear programming model

Based on the discussion above, we give preferences to observed PPIs. Interactions between proteins sharing a poor neighbor have priorities of being explained by DDIs. For such interactions, smaller weights are given to domain pairs involved. The mathematical description is as follows: Suppose _{min}_{max}_{min}_{max}_{k}_{1 }contains proteins with small degree while _{K}_{1 }and an interaction centering on the protein, smaller weights are given to domain pairs involved in the interaction. We define a set of domain pairs as follows: _{ij}|d_{ij}_{m}_{n}_{m}_{n}_{P}_{1}, _{m}_{s}, P_{n}_{t}_{P}

If _{ij}_{m}, P_{n}_{ij}

Then, we get a weighted integer linear programming model (WILP):

This model is named as WILP (Weighted Integer Linear Programming) model for later quotation. In practical computation, the linear integer programming is relaxed to a linear programming by allowing _{ij}_{mn}

Results and discussion

Data sets

PPIs of

The clustering coefficient of the PPI network

The clustering coefficient of the PPI network we used is 0.0970. To make it comparable, two network generation models are employed as null models: the scale-free model _{n, m}

S.cerevisiae's PPI network shows a relatively larger clustering coefficient

**S.cerevisiae's PPI network shows a relatively larger clustering coefficient**. To make the observed clustering coefficient of the PPI network (0.0970) comparable, two network generation procedures are employed as null models. The clustering coefficients of the null models are shown as boxplots.

Predicted DDIs are differently enriched in the golden data set

We first evaluate the performance difference between the modified model and the original one through counting the number of domain pairs confirmed by the golden data set. The linear programming problem after relaxation has 30394 variables and 20709 constraints, but there are only 756 variables (DDIs) in the golden data set, due to the difficulties in detecting DDIs experimentally. So we face a problem of lacking 'positives', and thus the rate of false positives may be excessive. But considering that our main purpose here is to investigate the role of the weights, we still expect to see a difference.

Specifically, 'sensitivity' and 'fold change' defined below are used to evaluate the performances of the models.

The results of WILP model and the ILP model are shown in Figure

WILP outperforms ILP in terms of the number of the predicted DDIs confirmed by the golden data set

**WILP outperforms ILP in terms of the number of the predicted DDIs confirmed by the golden data set**. (A) Sensitivities of WILP and ILP are compared as

Performance comparison between WILP and ILP

**sd**

**Total Predictions**

**True Positives**

**Sensitivity(%)**

**Fold Change**

1

12663 (12663)

382 (375)

50.53 (49.60)

1.21 (1.19)

0.9

10592 (10592)

361 (351)

47.75 (46.43)

1.37 (1.33)

0.8

8521 (8521)

341 (342)

45.11 (45.24)

1.61 (1.61)

0.7

6450 (7102)

306 (306)

40.48 (40.48)

1.91 (1.73)

0.6

4379 (5162)

276 (223)

36.51 (29.50)

2.53 (1.74)

0.5

2648 (3091)

190 (176)

25.13 (23.28)

2.88 (2.29)

0.4

1613 (1620)

145 (143)

19.18 (18.92)

3.61 (3.55)

0.3

875 (779)

104 (89)

13.76 (11.77)

4.78 (4.59)

0.2

430 (279)

69 (37)

9.13 (4.89)

6.45 (5.33)

0.1

131 (63)

29 (16)

3.84 (2.12)

8.90 (10.21)

Comparison of WILP and ILP in terms of the number of the predicted DDIs confirmed by the golden data set. Predicted DDIs verified according to the golden data set are denoted as true positives. 'Sensitivity' and 'Fold Change' are defined in the main text. Numbers marked in red means that WILP outperforms ILP

There is a parameter _{1}. According to the preceding reasoning, a larger _{1 }and the extracted prior information is more precise but less. In the numerical experiments, a broad range of

Statistical significance of the weights

The performance difference between WILP and ILP has been shown above. In this section, we confirm that the observed accuracy improvement is not obtained by chance. That is to say, the weights derived from network properties are indeed meaningful. Particularly, random weights are given to WILP (the null model) and the distribution of TP is estimated and compared with real values (Table

Statistical significance of the weights

**Statistical significance of the weights**. Random weights are given to WILP and the distributions of 'TP' are shown as 'violin plots'.

Functional similarity analysis of predicted DDIs

WILP outperforms ILP in terms of the number of the predicted DDIs confirmed by the golden data set. In this section, these two models are compared in a functional view. In gene expression analysis, co-expression genes are deemed to be functionally similar for they may be involved in a same biological process. It is natural to hypothesize that physical interacting domains have similar biological functions. This impels us to compare WILP and ILP by examining the functional similarity of predicted DDIs. GO terms have been mapped to Pfam entries

Similarity analysis of the predicted DDIs

**Similarity analysis of the predicted DDIs**. Comparison of functional similarities of the predicted DDIs obtained by ILP and WILP (

Conclusions

Knowledge about domain-domain recognition patterns provide insights of the organization of PPIs and protein function. While DDIs are difficult to be determined experimentally, many computational approaches have been proposed aiming at discovering the patterns from DDIs, among which parsimony-based models show their advantages in easy implementation and power in detecting specific DDIs. We notice that existing methods only make use of PPIs in a local way. As PPI networks are an important case of complex networks and exhibit global properties such as 'small-world', 'scale-free' and relatively larger clustering coefficient, in this paper, we try to integrate the clustering coefficient feature as prior known knowledge into the computational model.

Results show that WILP outperforms ILP to some extent, which confirms us that those properties are biologically meaningful. This may shed light on a new perspective in studying DDI and PPI networks. Currently, studies of complex networks mainly focus on those common features but few work has been done to investigate what is behind them. We point out that those features can be connected with a specific problem in computational biology. Then we can study the role of the features in a context-depended way, where plenty of tools have been developed.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XSZ and RSW designed the study. CC, JFZ and QH implemented the method, performed the experiments and analyzed the data. All authors contributed to discussions on the method. CC and XSZ wrote the manuscript. All authors revised the manuscript and approved the final version.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 60873205).

This article has been published as part of