Tongji University, 1239 Siping Road, Shanghai, P.R. China

Department of Information, Electronic Engineering Institute, Hefei, Anhui 230027, P.R. China

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060, P.R. China

Department of Automation, University of Science and Technology of China, Hefei, Anhui 230027, P.R. China

Abstract

Background

Protein-protein interactions (PPIs) play crucial roles in virtually every aspect of cellular function within an organism. Over the last decade, the development of novel high-throughput techniques has resulted in enormous amounts of data and provided valuable resources for studying protein interactions. However, these high-throughput protein interaction data are often associated with high false positive and false negative rates. It is therefore highly desirable to develop scalable methods to identify these errors from the computational perspective.

Results

We have developed a robust computational technique for assessing the reliability of interactions and predicting new interactions by combining manifold embedding with multiple information integration. Validation of the proposed method was performed with extensive experiments on densely-connected and sparse PPI networks of yeast respectively. Results demonstrate that the interactions ranked top by our method have high functional homogeneity and localization coherence.

Conclusions

Our proposed method achieves better performances than the existing methods no matter assessing or predicting protein interactions. Furthermore, our method is general enough to work over a variety of PPI networks irrespectively of densely-connected or sparse PPI network. Therefore, the proposed algorithm is a much more promising method to detect both false positive and false negative interactions in PPI networks.

Background

Protein-protein interactions (PPIs) play a critical role in most cellular processes and form the basis of biological mechanisms. Over the last decade, the development of novel high-throughput techniques, such as yeast-two-hybrid (Y2H), tandem affinity purification (TAP), and mass spectrometry (MS), has resulted in a rapid accumulation of data that provide a global description of the whole network of PPI for many organisms

In the past several years, many computational techniques have been proposed to assess and predict protein interactions. Among them, the network-topology-based methods have attracted extensive attention due to geometric intuition and computational feasibility. The main idea of these approaches is to rank the reliability of an interacting protein pair solely based on the topology of the interactions between the protein pair and their neighbors within a short radius

These network-topology-based methods do yield impressive results on some benchmark data sets. However, there are two main shortcomings of using indices such as IG, IRAP, CD-Dist, and FSWeight for assessing and predicting protein interactions. On one hand, most of these methods are based on a single biological evidence, which makes them hardly gain both a high specificity and a good sensitivity at the same time

On the other hand, their performance will deteriorate rapidly when the network-topology-based methods are applied to sparse PPI networks

It is well known that proteins generate interactions with each other based on their biochemical and structural properties

Results

In this section, we firstly quantify the success of embedding PPI network into low dimensional metric space using probability density function and Receiver Operator Characteristic (ROC) curve which are learned from the data given by manifold embedding. The performance of the proposed approach is then evaluated using functional homogeneity and localization coherence of protein interactions from four PPI networks that are derived from various scales and high-throughput techniques, i.e., yeast-two-hybrid (Y2H), tandem affinity purification (TAP), and mass spectrometry (MS).

The distribution of pairwise distance in embedding space for interactions and non-interactions

In order to quantify the success of embedding PPI network into low dimensional metric space, we learn the following two conditional probability density functions based on original PPI network and its embedding space:

In Figures

The conditional probability density functions

**The conditional probability density functions p(Distance|Interaction) and p(Distance|Non-interaction) learned from embedding the components of the following four PPI networks into 20-dimentional Euclidean space: (a) Krogan, (b) DIP, (c) Tong, (d) DIP+BioGRID**. The

Receiver operator characteristic curves for embedding PPI network into metric space

To measure the ability of the proposed manifold embedding method to recover original PPI network, we use a standard ROC curve analysis. Figure

The ROC curves measuring the ability of recovering the original PPI networks (a) Krogan, (b) DIP, (c) Tong, and (d) DIP+BioGRID using embedding space dimensions of 2 to 20

**The ROC curves measuring the ability of recovering the original PPI networks (a) Krogan, (b) DIP, (c) Tong, and (d) DIP+BioGRID using embedding space dimensions of 2 to 20**. The

where TP (True Positive) is the number of true interacting protein pairs which are predicted to be interacting (the distance between point pair in the embedding space is less than a given distance threshold). TN (True Negative) is the number of non-interacting protein pairs that are predicted to be non-interacting (the distance between point pair in embedded space is larger than a given distance threshold). FP (False Positive) is the number of non-interacting protein pairs which are predicted to be interacting, and FN (False Negative) is the number of interacting protein pairs which are predicted to be non-interacting.

It is well known that a ROC curve depicts relative trade-offs between true positive (benefits) and false positive (costs). The best possible ROC curve would contain a point in the upper left corner or coordinate (0, 1) of the ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). From Figure

As can be seen from Figure

Assessing the reliability of protein interactions

To validate the proposed method for assessing the reliability of protein interactions in the case of embedding into the 20-dimensional space, we systematically compare it with IG

For our proposed method, there are two parameters:

The choice of the parameter ε

**The choice of the parameter ε**.

The strategy of 'guilt by association' provides the evidence that interacting proteins are likely to share a common function and cellular localization

In the study, the Gene Ontology (GO) based annotations is used to evaluate the functional homogeneity and localization coherence. The GO is one of the most important ontologies within the bioinformatics community

Experiment using the densely-connected PPI network

The experiment is conducted on the DIP+BioGRID dataset. The densely-connected protein interaction network comprises of 72794 non-redundant interactions between 5613 of yeast proteins. Among the 5613 proteins in our densely-connected yeast dataset, 5179 proteins have functional annotations. The number of interactions whose two proteins both have common function annotation is 56639. Therefore, the proportion of interactions with functional homogeneity is 77.8%. About 5165 proteins in this yeast dataset have cellular component annotations. The number of interactions whose two proteins both have common localization annotation is 49469. Therefore, the proportion of interactions with localization coherence in the dataset is 67.9%.

We rank interactions according to their RI values from the lowest to highest, and measure the functional homogeneity and localization coherence by computing the rate of interacting protein pairs with common functional roles and cellular localization. Figure

Comparison of our method, IG, FSWeight, and CD-Dist for assessing the reliability of interactions in term of functional homogeneity and localization coherence

**Comparison of our method, IG, FSWeight, and CD-Dist for assessing the reliability of interactions in term of functional homogeneity and localization coherence**. The comparison is performed by using data on 72794 interactions from the BioGRID database (version 2.0.52)

Experiment using the sparse PPI networks

As is well known, the real PPI networks are typically very sparse, with average degree of 7 or less

We rank interactions according to their RI values in the same manner as we did in the last section, and measure the functional homogeneity and localization coherence by computing the rate of interacting protein pairs with common functional roles and cellular localization. The experimental results on the three datasets Krogan, DIP, and Tong are respectively showed in Figures

Comparison of our method, IG, FSWeight, and CD-Dist for assessing the reliability of interactions in term of functional homogeneity and localization coherence

In Figure

Since IG, FSWeight and CD-Dist methods are define purely on basis of the topology of the neighbors of the protein pairs and their formulation implicitly requires the protein pairs being considered to have sufficient number of partners

Predicting new protein interactions

In this section, we evaluate our proposed method for predicting new protein interaction, using the same data sets as in assessing the reliability of protein interactions. Because IG method tends to assume interaction between protein pairs to be true, IG may overestimate the reliability for the missing links during the false negative detection process

We inspect whether the top new interactions predicted by our method exhibit a higher degree of functional homogeneity and localization coherence than those predicted using other two approaches. The results are illustrated in Figures

Comparison of our method, FSWeight, and CD-Dist for predicting new interactions in term of functional homogeneity and localization coherence

**Comparison of our method, FSWeight, and CD-Dist for predicting new interactions in term of functional homogeneity and localization coherence**. The comparison is performed by using data on 72794 interactions from the BioGRID database (version 2.0.52)

Comparison of our method, FSWeight, and CD-Dist for predicting new interactions in term of functional homogeneity and localization coherence

The results in Figure

Figure

Figure

Figure

Discussion

It is worthwhile to highlight several aspects of the proposed approach here:

(1) We present a novel network-topology-based approach with information fusion for assessing and predicting protein interactions. It effectively avoids the false positives and false negatives from "single evidence models" such as IG, CD-Dist, and FSWeight.

(2) The purpose of low-dimensional manifold modeling is to represent each node of a PPI network (graph) as a low-dimensional vector which preserves similarities between the node pairs, where similarity is measured by a PPI network (graph) similarity matrix that characterizes certain geometric properties of the data set. By manifold modeling, we make our proposed approach general enough to work over a variety of PPI networks irrespectively of densely-connected or sparse PPI network.

(3) In order to make the ISOMAP algorithm suitable for PPI network datasets, we present a fast ISOMAP algorithm based on minimum set cover (MSC). The success at detecting both new and spurious interactions confirms that the proposed algorithm is able to uncover the intrinsic structural features of PPI network. To our knowledge, this paper is one of the first studies aiming at utilizing manifold learning theory to assess and predict protein interactions.

Conclusions

In this paper, we have developed a robust technique to assess and predict protein interactions from high-throughput experimental data by combining manifold embedding with multiple biological data integration. The proposed approach first used the logistic regression approach to integrate multiple genomic and proteomic data sources. After obtaining a weighted PPI network, we utilized the fast-ISOMAP algorithm based on manifold learning theory to transform the weighted PPI network into a low dimensional metric space, and then a reliability index which indicates the interacting likelihood of two proteins is assigned to each protein pair in the PPI networks on the basis of the similarity between the points in the embedded space. To the best of our knowledge, this is one of the first studies on assessing and predicting protein interactions which explicitly considers low-dimensional manifold modeling and uses manifold learning theory to embed PPI network into a low-dimensional metric space. The experimental results show our method consistently performs better than the existing network-topology-based methods on both densely-connected and sparse PPI networks, which indicates that the proposed approach is independent on the sparseness of the PPI network and might shed more light on assessing and predicting protein interactions.

Although our experimental results on the four protein interaction data sets demonstrate that our proposed method is insensitive to the dimensionality of the embedding space, the intrinsic dimensionality of data manifold, or degrees of freedom, contributes to capture the inherent attributes hidden in the high-dimensional unorganized observation space. Therefore, how to estimate the intrinsic dimensionality of the PPI dataset is a problem deserving further investigation. In addition, the ISOMAP algorithm requires the analyzed manifold is a convex subset of ^{D}^{D }

Methods

Data sets under study

There are three different types of data used in this paper: 1) gold standard data sets of known interactions (true positives, TPs) and non-interacting protein pairs (true negatives, TNs); 2) genomic and proteomic feature data sets; and 3) protein interaction data sets.

Gold standard data sets

To provide a measure of assessing the reliability of evidence coming from each information source, the gold standard positive (GSP) and gold standard negative (GSN) data sets, which contain a sufficiently large number of TPs (proteins known to be interacting) and TNs (proteins known to be non-interacting) respectively, are required to constructed. Based on the assumption that proteins belonging to the same complex are likely to interact with each other, information on protein complex membership specified in the MIPS complex catalog is often used to construct the GSP data sets. Unlike positive interactions, it is rare to find a confirmed report on non-interacting pairs. Considering the small fraction of interacting pairs in the total set of potential protein pairs, we use a random set of protein pairs, excluding those interacting pairs that are known, as the GSN data set. In this paper, we collected 12,279 GSP and 19,641,036 GSN protein pairs to train the logistic regression model and compute the model parameters.

Genomic and proteomic feature data sets

A total of six different types of genomic and proteomic data sets obtained from

(1) Gene expression. This data set was obtained from publicly available gene expression data

(2) Essentiality. This data set was derived from ref.

(3) Sequence similarity. This is a quantitative measure of sequence match significance. The data set was generated from the SGD NCBI-BLASTP

(4) Transcription factor. This data set, which takes non-negative discrete values, was obtained from a study published by Harbison et al.

(5) Domain-domain interaction. This data set, which has a value ranging from 0 to 1, was constructed from a study published by Deng et al.

(6) Homology based PPI. This data set was derived from the SGD NCBI PSI-BLAST hits results

Table

Characteristics of six genomic and proteomic data

**Category**

**Feature abbreviation**

**Feature**

**Number of protein pairs**

**Number of proteins**

**Range**

**Data source**

**Genomic data**

GE

Gene expression

19,653,315

6,270

[-1, 1]

ESS

Essentiality

667,590

1,156

{2, 1, 0}

SEQ

Sequence similarity

7,742

6,270

Real value-Non negative

TF

Transcription factor

19,397,106

6,229

Non-negative discrete, most 0

**Proteomic data**

DD

Domain-domain interaction

125,435

6,359

[0,1]

HO

Homology based PPI

177,667

6,270

Non-negative discrete, most 0, 1

Protein interaction data sets

To demonstrate the effectiveness of our methodology, we have used four PPI datasets of

Characteristics of four protein interaction data

**Data set**

**Number of interactions**

**Number of proteins**

**Data source**

**Krogan**

12934

3645

**DIP**

17173

4875

**Tong**

7622

2171

**DIP+BioGRID**

72794

5613

Algorithm

The overview framework for utilizing manifold embedding and multiple information integration to assess and predict protein interactions is illustrated in Figure

Framework for the combination of manifold embedding and multiple information integration to assess and predict protein interactions based on the integration of diverse sources

**Framework for the combination of manifold embedding and multiple information integration to assess and predict protein interactions based on the integration of diverse sources**.

Logistic regression integration of information sources

Combining evidence from many different sources as features in a supervised learning framework has been proven a successful strategy in reconstructing PPI networks such as yeast

Now, for a protein pair of interest, the linkage weight prediction problem based on their values _{1},_{2},..., _{n }

A general binary regression model takes the form:

where the dependent variable _{1},...,_{n}^{T }_{0},_{1},..., _{n}^{T }

Then, a logistic regression model is expressed in the following:

The parameter vector

Manifold embedding

Finding a well-fitting null model for weighted PPI networks is a fundamental problem and such a model will provide insights into the interplay between network structure and biological function

Isometric feature mapping

ISOMAP attempts to find a low-dimensional embedding where the distances between points is approximately equal to the shortest path distances (on a neighborhood graph in the original input space). The power of ISOMAP can be demonstrated by the three-dimensional "Swiss roll" data set in Figure

The 3,000 source data points sampled from a Swiss roll surface and its two dimensional embedding

**The 3,000 source data points sampled from a Swiss roll surface and its two dimensional embedding**. (a) Swiss roll data set. (b) Superimposed with minimum subset of neighborhoods. (c) Two-dimensional embedding by ISOMAP. (d) Two-dimensional embedding by fast-ISOMAP.

_{1},_{2},...,_{N}_{ij}_{ij }_{i }_{j }_{ij}

_{ij }_{i }_{j }_{ij }

For neighboring points, _{ij }

_{p }^{G}^{G}_{ij}_{ij}_{ij }_{p }

Fast isometric feature mapping

Although it has proven to be effective in some benchmark artificial and real world data sets, ISOMAP is limited to data sets with ^{G}^{3}) time. It can be improved to ^{2}log^{3}) time complexity

A prerequisite of the ISOMAP algorithm is to get the _{1},...,_{N }_{1}, _{2},..., _{N}_{1},..., _{N}_{i}_{i}_{S∈C }

This problem is a classical minimum set cover problem which is one of Karp's 21 original NP-complete problems

**Algorithm 1**. Greedy minimum set cover (

1: Initialize

2: **While ****do**

3: Select

4:

5:

6: **end while**

**7: **Return (

Algorithm 1 can be efficiently implemented with time complexity _{S∈F}

For PPI networks, where the number of proteins is typically in thousands, the framework of manifold embedding based on fast-ISOMAP can be summarized as

_{1},..., _{N }

Weighted topological metric

Once the nodes of a PPI network have been embedded to a low-dimensional metric space, we can attempt to characterize the topological property by assigning a suitable reliability index (RI), a likelihood indicating the interaction of two proteins, to each protein pair in PPI network based on the similarities between the points in the embedded space. Here we use the weighted Czekanowski-Dice distance index (weighted CD-Dist) to evaluate the reliability of protein interactions.

Angelelli et al.

where _{uv }

It is reasonable to use this RI in our study. On one hand, this distance increases the weight of the shared interactors by giving more weight to the similarities than to the differences

Given a _{1},...,_{N}^{d × N}

^{w}

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YL & ZY conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote paper. ZJ & LZ was responsible for writing software. All the above actions were supervised by DH.

Acknowledgements

This work is supported by the grants of the National Science Foundation of China, Nos. 61102119 & 61133010 & 31071168 & 61005010 & 60905023 & 60975005 & 71001072 & 61001185 & 61171125. The authors would like to thank all the guest editors and anonymous reviewers for their constructive advices.

This article has been published as part of