Department of Computer Science and Engineering, Ohio State University, Columbus, OH, USA

Abstract

Background

Advances in high-throughput technology has led to an increased amount of available data on protein-protein interaction (PPI) data. Detecting and extracting functional modules that are common across multiple networks is an important step towards understanding the role of functional modules and how they have evolved across species. A global protein-protein interaction network alignment algorithm attempts to find such functional orthologs across multiple networks.

Results

In this article, we propose a scalable global network alignment algorithm based on clustering methods and graph matching techniques in order to detect conserved interactions while simultaneously attempting to maximize the sequence similarity of nodes involved in the alignment. We present an algorithm for multiple alignments, in which several PPI networks are aligned. We empirically evaluated our algorithm on three real biological datasets with 6 different species and found that our approach offers a significant benefit both in terms of quality as well as speed over the current state-of-the-art algorithms.

Conclusion

Computational experiments on the real datasets demonstrate that our multiple network alignment algorithm is a more efficient and effective algorithm than the state-of-the-art algorithm, IsoRankN. From a qualitative standpoint, our approach also offers a significant advantage over IsoRankN for the multiple network alignment problem.

Background

Advances in technology have enabled scientists to determine, identify and validate pairwise protein interactions through a range of experimental methods such as two-hybrid analysis

A PPI network can be represented as an undirected graph in which each vertex indicates a protein and each edge indicates an interaction between two proteins. The number of interactions is usually linear in the number of proteins in a PPI network. In other words, a protein only interacts with a limited portion of proteins in the same network. The graph is usually unweighted although an edge can often be associated with a confidence value indicating the probability that this edge is a true positive

A local network alignment (LNA) algorithm aims to find highly similar pairs of motifs, i.e., subnetworks, across networks. The main drawback of LNA is that it might map one motif to several similar motifs

GNA requires that each protein in the network should be either matched to some proteins in other networks or marked as unaligned protein by the alignment, where the matches should be consistent

The main goals of GNA are to conserve the network topology and to ensure that the matched proteins' sequences are as similar as possible

We present a simple but scalable approach for global multiple network alignment to exploit the sparsity of the PPI networks. Since fully integrating the sequence similarity and network topology is time-consuming especially in multiple networks, we consider these two goals independently. Our approach relies on first preprocessing the similarity scores and clustering all proteins into groups based on their similarity. We then adopt a seed-expansion heuristic strategy

We present a detailed empirical study which illustrates the benefits of the proposed approach on three real datasets. In short, we find that the proposed approach significantly improves over the state-of-the-art IsoRankN algorithm

Related work

Several algorithms, including MaWISH

PATH

Græmlin 2.0

Methods

Definition of multiple network alignment

Assume we have _{1}, _{2},..., _{k}_{i }_{i}_{i}_{x}_{y}_{i }_{x }_{y}

Metrics for alignments

In order to identify functional orthologs across multiple networks, the goal of PPI network alignment algorithms generally is to find corresponding matches across all networks as these match-sets should contain similar proteins and conserve as many interactions as possible

The sequence similarity score for two proteins _{x }_{y }_{x}_{y }_{x}_{y}_{x }_{y }_{x }_{y }_{x}_{y}

The average similarity score of an alignment

where _{i}_{i}

Since the similarity score between two proteins from the same network is zero, a match-set that includes proteins across several networks instead of only two networks will be preferred.

Let _{i}_{i}_{x}_{y}_{i }_{x }_{x' }_{y }_{y'}, i.e., _{x}_{x'}) and _{y}_{y'}), or (2) _{x }_{y }_{x}_{y}_{x}_{x'}) > 0 and _{y}_{y'}) > 0.

Algorithm overview

As we mentioned before, optimization of the objective function consisting of these two goals for the pairwise network alignment problem is NP-hard. For this reason, we propose a heuristic method which independently considers these two goals in sequence in order to find a multiple alignment in feasible time. The main idea here is that there exist several

The overall procedure of our algorithm is illustrated in Figure

The procedure of our method

**The procedure of our method.**

Notations in the algorithms

Whether the protein

{**|**

The network where protein

The match-set containing protein

The networks where at least one protein in

Preprocessing

The similarity matrix only represents the sequence similarity. If we only identify the seeds based on this similarity matrix, the seeds would not reflect any network topology. Therefore, we adopt a simple preprocessing method to integrate a part of network topology into the similarity matrix. Note that if we consider the whole network topology, the preprocessing would be too time-consuming (see

The main idea is that a pair of proteins with similar neighbors should be aligned together rather than a pair of proteins with a close similarity score but without any similar neighbors. Therefore we add an extra score which measures the similarity of two proteins' neighbors to the original similarity score. This extra score of a pair of proteins _{a }_{b}_{a}_{b}_{a}_{b}_{a }_{b}

An example is shown in Figure

An example with two networks, A and B

**An example with two networks, A and B.** The two tables are the similarity scores with and without preprocessing. The solid lines connecting two proteins in the same network are edges, and bold lines are conserved edges. The arrows across two networks are match-sets. The threshold

**Algorithm 1 **Seed Generation

**Input: **A set of clusters

**Output: **A set of seeds Ê.

1: Ê ← Ø;

2: **for all ****do**

3: **for all **_{x}_{y }_{x}_{y}**do**

4: **if **_{x}_{y}**then**

5: Ê ← Ê ∪(_{x}_{y}

Seed generation via clustering

Seeds, which are pairs of proteins with high similarity, can be identified easily by pruning out all pairs of proteins with a similarity score lower than a threshold. However, this approach is not optimized since one protein might be similar to several proteins which are not mutually similar, and therefore the seeds generated by this approach might ruin the quality of match-sets formed by subsequent stages. As the seeds should be generated by globally considering their mutual similarity, we observe that this is equivalent to the clustering problem, in which mutually similar proteins should be clustered together.

We therefore adopt a clustering algorithm to identify the groups of similar proteins and then use Algorithm 1 to identify seeds. Algorithm 1 examines all possible pairs of proteins in each cluster (line 3), and then it uses the threshold parameter

The clustering result directly affect the seeds generated by Algorithm 1, so it is very important for our algorithm to choose a clustering method which generates higher similarity score. Since the average similarity score of a match-set would be determined by the similarity scores of all pairs of proteins within a match-set, the clusters, which are used to generate seeds, should have nearly globular shape, i.e., any pair of proteins within a cluster should be reasonably similar. Density clustering methods, such as DBSCAN

**Algorithm 2 **Seed expansion on multiple networks

**Input: **A set of seeds Ê.

**Output: **A set of match-sets Ŝ

1: **for all ****do**

2: _{x}_{x}

3: **for all **{_{x}_{y}**do**

4: Push ((_{x}_{y}_{x}_{y}

5: **while ****do**

6: (_{x}_{y}

7: Merge(_{x}_{y}

8: **for all **match-set **|****do**

9: **for all **(_{i}_{j}_{i}_{j}**do**

10: **for all **(_{x}_{y}_{x }_{i}_{y }_{j}**do**

11: Push ((_{x}_{y}_{x}_{y}

12: **while ****do**

13: (_{i}_{j}

14: **if **Merge(_{i}_{j}**then**

15: **for all **(_{x}_{y}_{x }_{i}_{y }_{j}**do**

16: Push ((_{x}_{y}_{x}_{y}

17: Ŝ ← {match-set

Seed-expansion strategy

We observe that if a new match-set which consists of the neighbors of an existing match-set is formed, two edges connecting the existing match-set to the new match-set are conserved. Therefore, we start conserving edges by first expanding the seeds, i.e., forming new match-sets consisting of the neighbors of seeds. Since we still want to obtain higher similarity scores during expansion, we adopt a priority queue which contains all expandable pairs of proteins in order to iteratively select the expandable pair with the highest similarity score. Once a new pair is popped from the priority queue and used to form a new match-set, we put the neighbors of the new match-sets into the priority queue in order to expand the alignment. Thereby, each time we pick a pair of proteins and align them together, we conserve two edges and expand the alignment. This method is very efficient to directly conserve a higher amount of edges as we still obtain high similarity scores.

**Algorithm 3 **Merge

**Input: **A pair of proteins (_{i}_{j}

**Output: **A boolean value indicating whether the algorithm merges _{i}_{j}

1: **if **|_{i}_{j}_{* }|_{i}_{j }**then**

2: **return false**;

3: _{x}_{y}_{x}_{y }_{i}_{j }

4: _{x}_{y}_{x }_{i}_{y }_{j}_{x}_{y}

5: _{i}_{j}_{i}_{j}

6: **if ****and **_{* }_{i}_{j}**then**

7: Merge _{i}_{j}

8: _{i}_{j}**← true**;

9: **return true**;

10: **else**

11: **return false**;

Figure

Merging match-sets

In multiple alignments, each match-set usually consists of more than two proteins, some of which might be in the same network, so some of the seeds generated by Algorithm 1 and the match-sets formed by the seed-expansion strategy should be merged if the proteins in these match-sets are mutually similar enough. We introduce the procedure similar to agglomerative clustering in Algorithm 2 in order to merge match-sets. In agglomerative clustering, each individual object forms a cluster only containing itself at the beginning, and then two clusters are merged into one cluster each round. Here, each protein forms a match-set containing only itself at the beginning (line 1-2) and we iteratively merge two match-sets into one match-set according to the merging criterion. First, we merge those proteins contained by a seed to form a larger seed which might span more than two networks (line 3-7). Then, we apply the seed-expansion strategy, i.e., we expand the seeds through aligning the neighbors of the seeds together (line 8-16). Once a new pair of proteins is selected by the priority queue, we use the merging criterion to examine whether the two match-sets respectively containing these two proteins should be merged or not (line 14). If the merging criterion merges these two match-sets, we put all pairs of their unaligned neighbors in the priority queue in order to conserve more edges.

The merging criterion, Algorithm 3, determines whether two match-sets should be merged. If yes, the algorithm merges these two match-sets and then returns true. If no, it simply returns false. The first criterion is that the merged match-set cannot be larger than a size threshold, which is the number of networks this match-set crosses multiplied by the parameter

Aligning remaining proteins

Since there are no expandable pairs of proteins after stage 3 (Algorithm 2), we mainly focus on similarity scores in stage 4. The alignable pairs in this stage are all pairs of proteins for which (1) both proteins have not been aligned yet, and (2) the similarity score of the pair is not zero. The second condition is used to prune a large amount of pairs satisfying the first condition. We sort these alignable pairs and then iteratively pick the pair with the highest similarity score, and again we apply the merging criterion at the same time as Algorithm 3 to form match-sets across more than two networks. Note that since the PPI networks are usually well-connected, the number of remaining proteins for stage 4 is typically a very small portion of all proteins.

Complexity analysis

Let _{max }

In stage 1, we adopt the I1 and I2 criterion functions in CLUTO, whose time complexity to cluster all proteins is ^{2 }log ^{2}). However, if the clusters are balanced, i.e., each cluster consists of average ^{2}) time complexity.

Let _{max }_{max }_{max }_{max }^{4}^{2 }log

Hence, the total time complexity of our algorithm is ^{2 }log

Result and discussion

Experiment setup

In this section, we present experimental studies on real datasets. We performed our experiment on three real databases, DOMAIN

Table

Experimental datasets

**Datasets**

**Species**

**# proteins**

**# PPIs**

**Percent of proteins with GO terms**

DOMAIN

D. mela

5014

10884

95.8%

S. cere

3481

11186

87.2%

C. eleg

1864

2159

90.0%

DIP

D. mela

7486

22340

82.89%

S. cere

5139

24821

93.87%

H. sapi

5025

12705

95.22%

C. eleg

3095

4891

68.27%

E. coli

2953

11759

65.09%

M. musc

1149

1171

97.39%

H. pylo

708

1354

68.05%

BioGRID

D. mela

7210

24710

86.1%

C. eleg

3420

6339

87.0%

S. pomb

1995

12573

99.7%

H. sapi

8282

45031

93.20%

A. thal

1609

2861

94.59%

The raw BLAST bit scores favor long sequences while the normalized BLAST bit scores, whose values are between 0 and 1, are independent of the sequence length.

The DOMAIN dataset is a subset of an old-version dataset from DIP (version 10/14/2008). It requires that the sequence of each protein in the DOMAIN dataset should contain at least one domain, where a domain is defined as a FASTA sequence pattern. We then use the information of domains to calculate the similarity score: the similarity score in the DOMAIN dataset is

where _{x }_{y }

The experiments were performed on a dual core machine (Intel core i5 650) with 3.2 GHz of processor speed and 16GB of main memory. We discuss the tradeoff when tuning

In addition to the number of conserved edges and average similarity score, we used p-value

and we select the lowest p-value among all GO terms as the p-value of this match-set. Following standard practice, a GO term is considered enriched if the p-value of one of the match-sets respect to this GO term is less than 10^{-4}.

For multiple alignments, most existing algorithms do not align all proteins; in other words, some proteins would not belong to any match-set.

Comparison between different clustering methods

Figure

The trade-offs in DIP dataset

**The trade-offs in DIP dataset.** The average similarity score and the conserved edge rate of different clustering methods (agglo and rbr) and different criterion functions (i1 and i2).

Comparison with IsoRankN

When evaluating the performance, we compare our method with preprocessing against IsoRankN

Comparison of quality between our method and IsoRankN

**Datasets**

**DOMAIN**

**DIP**

**BioGRID**

**Algorithms**

**Ours**

**IsoRankN**

**Ours**

**IsoRankN**

**Ours**

**IsoRankN**

Coverage

8588

3372

24119

19555

21385

13928

Average similarity score

237.06

174.32

0.00509

0.00426

0.1735

0.0834

Conserved edges rate

.260

(6111 of 23507)

.160

(615 of 3138)

.2209

(17365 of 78611)

.086

(4696 of 54364)

0.1781

(13003 of 72975)

0.0685

(2586 of 37434)

# total enriched GO terms

1026

90

3871

1893

1123

523

The average similarity scores for DOMAIN dataset are computed by equation 2 and the average similarity scores for DIP and BioGRID datasets are normalized BLAST bit scores, computed by equation 1.

We also observe that the preprocessing can offer an improvement on the number of strictly conserved edges. The numbers of strictly conserved edges on DOMAIN, DIP, BioGRID datasets are 787, 1868, 1808 respectively without preprocessing while the numbers are 1125, 2243, 2216 respectively with preprocessing.

Figure

The lowest 500 p-values on each dataset

**The lowest 500 p-values on each dataset.** (a) DOMAIN (b) DIP (c) BioGRID.

Figure

The best match-set (with the lowest p-value) discovered by our algorithm

**The best match-set (with the lowest p-value) discovered by our algorithm.** Proteins with the same circle line style are in the same match-set formed by IsoRankN. Proteins without any circle are not covered by IsoRankN. The same color presents the same species. (a) DIP (DIP ID) (b) BioGRID (UniprotKB AC).

Our method significantly outperforms IsoRankN in terms of execution time since IsoRankN iteratively updates a huge matrix whose size is proportional to the square of the total number of proteins in all networks. It is important to note here that the dominant cost in our method is the time to preprocess and run the clustering algorithm. These two steps only have to be executed once to generate alignments with different trade-offs. Keep in mind that IsoRankN has to be rerun from scratch if the trade-off weight is changed. For DOMAIN, DIP, and BioGRID datasets, the alignment stage only takes 5 to 40 seconds while the preprocessing step and clustering stage totally cost 37, 234 and 175 seconds respectively for our method, and IsoRankN takes 4621, 41719, and 35224 seconds on average respectively. Because IsoRankN does need to to rerun the whole process every time the threshold parameters are changed, as shown in Figure

Execution time for generating several alignments

**Execution time for generating several alignments.**

To summarize, we find that

Conclusions

In this article, we present an efficient global PPI network alignment algorithm. Although our approach can also be applied to pairwise alignments, it mainly addresses multiple alignments, which are more comprehensive. In order to efficiently identify functional orthologs across multiple networks, we propose the merging criterion and apply the seed-expansion strategy and clustering techniques to conserve interactions and find similar protein sequences. Results on a number of real datasets highlight the effectiveness, efficiency, and scalability of our algorithm when we compared with the state-of-the-art multiple alignment algorithm, IsoRankN. From a qualitative standpoint, our approach also offers a significant advantage over IsoRankN for the multiple alignment problem.

In this work we do not explicitly consider the case of weighted PPI networks. In many cases weights representing the confidence associated with a detected interaction may be available. As part of future work, we would like to investigate the GNA problem with weighted interaction representing the evidence of interaction existence. A promising direction here is to develop a mechanism to integrate the similarity score with the interaction confidence. Another possible strategy is to adopt state-of-the-art graph clustering algorithms, such as

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Y-KS developed software, conducted experiments, and drafted the manuscript. SP suggested the initial idea which was subsequently refined by both authors, guided the development and analysis of the method, helped design the experiments and co-wrote the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

We thank Tyler Clemons, Venu Satuluri, Mikhail Zaslavskiy, Chung-Shou Liao, and Xin Guo for providing experimental data and helpful suggestions for improving this work. This work is supported in part by the following NSF grants: CCF-0702587, IIS-0917070 and IIS 1141828. The work described herein reflects the opinions of the authors and does not necessarily reflect the opinions of the NSF.

This article has been published as part of