Abstract
Background
Advances in highthroughput technology has led to an increased amount of available data on proteinprotein interaction (PPI) data. Detecting and extracting functional modules that are common across multiple networks is an important step towards understanding the role of functional modules and how they have evolved across species. A global proteinprotein interaction network alignment algorithm attempts to find such functional orthologs across multiple networks.
Results
In this article, we propose a scalable global network alignment algorithm based on clustering methods and graph matching techniques in order to detect conserved interactions while simultaneously attempting to maximize the sequence similarity of nodes involved in the alignment. We present an algorithm for multiple alignments, in which several PPI networks are aligned. We empirically evaluated our algorithm on three real biological datasets with 6 different species and found that our approach offers a significant benefit both in terms of quality as well as speed over the current stateoftheart algorithms.
Conclusion
Computational experiments on the real datasets demonstrate that our multiple network alignment algorithm is a more efficient and effective algorithm than the stateoftheart algorithm, IsoRankN. From a qualitative standpoint, our approach also offers a significant advantage over IsoRankN for the multiple network alignment problem.
Background
Advances in technology have enabled scientists to determine, identify and validate pairwise protein interactions through a range of experimental methods such as twohybrid analysis [1] and coimmunoprecipitation [2]. An important task, particularly when investigating networks of multiple species within a phylogeny tree, is that of network alignment where the objective is to understand which subnetworks are conserved across species in order to better understand the role of functional modules across the evolution of species [35]. Informally, given several (related) PPI networks and the protein sequence similarity scores between proteins within said networks, the goal of a network alignment algorithm is to find the best alignment, i.e., a mapping, which best represents functional orthologs among proteins within these networks. Additionally, this problem is also equivalent to identifying most biologically consistent matchsets, which are groups of proteins representing functional orthologs.
A PPI network can be represented as an undirected graph in which each vertex indicates a protein and each edge indicates an interaction between two proteins. The number of interactions is usually linear in the number of proteins in a PPI network. In other words, a protein only interacts with a limited portion of proteins in the same network. The graph is usually unweighted although an edge can often be associated with a confidence value indicating the probability that this edge is a true positive [6,7].
A local network alignment (LNA) algorithm aims to find highly similar pairs of motifs, i.e., subnetworks, across networks. The main drawback of LNA is that it might map one motif to several similar motifs [810]. Although this may sometimes reflect the gene duplication or gene fusion, LNA usually provides an unreasonable number of matches for a single protein. Moreover, it often suffers from a lack of a comprehensive picture of the network  a LNA usually aligns a very small portion of the entire network as most of the proteins do not appear in the alignment. Finally, a local alignment need not be consistent, i.e., a protein might be matched to more than two proteins where these matched proteins are not completely matched to each other. For this reason, global network alignment (GNA), which maps most of proteins across networks in a consistent and comprehensive way, has been proposed [8,1113].
GNA requires that each protein in the network should be either matched to some proteins in other networks or marked as unaligned protein by the alignment, where the matches should be consistent [9]. Although a GNA can be estimated by using the result of a LNA as possible matches, the inconsistent matches of LNA typically limit the quality of the result alignment.
The main goals of GNA are to conserve the network topology and to ensure that the matched proteins' sequences are as similar as possible [7,8]. Since it involves a tradeoff between these two competing goals, a GNA algorithm often employs a selftuning mechanism to control this tradeoff. A GNA algorithm may be a pairwise approach (where two networks are aligned) or a multiway approach (where multiple networks are simultaneously aligned). A pairwise alignment is a special case of a multiple alignment. Multiple alignment strategies are often more biologically comprehensive since they can represent the degree of gene duplication as well as gene fusion. However, existing algorithms are not scalable enough to efficiently process a number of large networks with different tradeoffs. Hence, we propose a scalable multiple global network alignment algorithm in this paper.
We present a simple but scalable approach for global multiple network alignment to exploit the sparsity of the PPI networks. Since fully integrating the sequence similarity and network topology is timeconsuming especially in multiple networks, we consider these two goals independently. Our approach relies on first preprocessing the similarity scores and clustering all proteins into groups based on their similarity. We then adopt a seedexpansion heuristic strategy [14] in order to exploit the sparsity of the network, where a seed matchset is a set of proteins with high similarity scores. The idea here is analogous to region growth strategies that have found favor within the image processing community [15]. Subsequently, we develop a simple merging criterion to enable the multiple alignment of PPI networks.
We present a detailed empirical study which illustrates the benefits of the proposed approach on three real datasets. In short, we find that the proposed approach significantly improves over the stateoftheart IsoRankN algorithm [10] along the twin axes of quality and efficiency. Specifically, from a qualitative perspective we found that our method can find more functionally consistent and more comprehensive alignments across all datasets. From an efficiency standpoint, our method is 50x to 500x faster than IsoRankN when executing on these three real datasets.
Related work
Several algorithms, including MaWISH [16], Græmlin [17], DOMAIN [18], PathBLAST [19,20], Network BLAST [3], and Network BLASTM [14], have focused on LNA. In order to efficiently obtain the maximal similarity score as conserving the maximal number of edges, Network BLAST and Network BLASTM exploit the sparsity of the PPI networks by adopting seedexpanding method. Since LNA has several drawbacks as mentioned above, recent research has focused on GNA.
PATH [11], GA [8], NATALIE [12], NetAlignBP, NetAlignMR [13], and PISwap [5] all focus on GNA and all of them only address the pairwise alignment problem. They formulate the pairwise alignment problem into an objective function which combines the two goals of GNA: maximize the average similarity score and the number of conserved edges. Li et al. [7] shows that optimization of this objective function is equivalent to a binary integer quadratic programming problem, in which each binary variable represents whether a pair of proteins is matched together. Since the integer quadratic programming problem is NPhard, all of these pairwise alignment algorithms are approximate algorithms. GA and PATH are based on graph matching algorithms that iteratively update a permutation matrix representing the matches between vertices of two graphs. However, this iterative process is timeconsuming especially for networks consisting of thousands of proteins. NATALIE derives integer linear programming formulations from the integer quadratic programming problem and applies several relaxation algorithms to solve the linear programming problem. Bayati et al. [13] propose NetAlignBP and NetAlignMP for the constrained network alignment problem, in which a set of legal matches is given. NetAlignBP and NetAlignMP are efficient for large networks, but if the legal matches are all pairs of proteins across two networks, as in the general (unconstrained) network alignment problem, they both suffer from a large space demand. PISwap first utilizes the Hungarian algorithm [21], which is used to find the global matches which maximize the similarity score, and then use a method similar to 20pt [22], which is a heuristic algorithm for the traveling salesman problem, to swap the matches generated by the Hungarian algorithm in order to conserve edges. All of these pairwise alignment algorithms can find a nearoptimal solution, but they cannot be adopted to the multiple alignment problem in a scalable fashion.
Græmlin 2.0 [23], IsoRank [24] and IsoRankN [10] generate multiple global alignments. Græmlin 2.0 automatically learns the scoring function's parameters, which indicate the weight of each feature of an alignment, from a training data set and then locally optimizes the scoring function in order to generate a nearoptimal alignment. Græmlin 2.0 therefore needs a set of known alignments and thus the quality of Græmlin is sensitive to the training data set. IsoRank and IsoRankN generate a score matrix to capture both sequence similarity and network topology for all pairs of proteins, and IsoRankN uses a spectral clustering method to improve IsoRank. However, IsoRank and IsoRankN both need to iteratively generate and update the huge score matrix and hence both methods are inefficient for multiple networks consisting more than ten thousand proteins.
Methods
Definition of multiple network alignment
Assume we have k PPI networks {G_{1}, G_{2},..., G_{k}}. Each PPI network is an unweighted undirected graph G_{i }= (V_{i}, E_{i}), where is a set of proteins and each edge (v_{x}, v_{y}) ∈ E_{i }represents an interaction between two proteins v_{x }and v_{y}. Let and be the total number of proteins. A network alignment Ŝ is a set of mutually disjoint matchsets . Each matchset is a subset of proteins, i.e., for and iff x ≠ y. Note that each matchset can consist of any number of proteins from each network and some proteins would not belong to any matchset. The main concept is that proteins in the same matchset are matched together. In other words, an alignment is similar to a partition of all proteins but some proteins might not belong to any matchset.
Metrics for alignments
In order to identify functional orthologs across multiple networks, the goal of PPI network alignment algorithms generally is to find corresponding matches across all networks as these matchsets should contain similar proteins and conserve as many interactions as possible [7,8]. Therefore, we adopt the average similarity score and the number of conserved edges introduced in [7] as the metrics for a network alignment.
The sequence similarity score for two proteins v_{x }and v_{y }is denoted by sim(v_{x}, v_{y }). sim(v_{x}, v_{y}) is set to 0 if v_{x }and v_{y }are in the same network since we only care about the similarity between proteins from different networks in the alignment problem. If v_{x }and v_{y }are in different networks, sim(v_{x}, v_{y}) is computed by BLAST or Pfam (Pfam 25.0) [25]. A higher similarity score indicates that the two proteins' sequences are more similar. Furthermore, both BLAST and Pfam scores for most of pairs of proteins are 0. The details about BLAST and Pfam are described in the result section.
The average similarity score of an alignment is defined as the weighted average similarity score of the corresponding matchsets:
where sim(S_{i}) is the average similarity score of the matchset S_{i}:
Since the similarity score between two proteins from the same network is zero, a matchset that includes proteins across several networks instead of only two networks will be preferred.
Let S(v_{i}) denote the matchset containing v_{i}. An edge (v_{x}, v_{y}) ∈ E_{i }is conserved by an alignment Ŝ if (1) there exists an edge , such that v_{x }and v_{x' }are aligned together as well as v_{y }and v_{y'}, i.e., S(v_{x}) = S(v_{x'}) and S(v_{y}) = S(v_{y'}), or (2) v_{x }and v_{y }are aligned together, i.e., S(v_{x}) = S(v_{y}). Thereby, we count the number of conserved edges for an alignment by examining whether each edge satisfies these conditions. Note that an alignment which has only one single matchset containing all proteins would conserve all edges, but, the similarity score of this alignment would simply be the average similarity score of all pairs of proteins and therefore is not a meaningful alignment. We further define the term strictly conserved edge as an edge that satisfies the first condition and sim(v_{x}, v_{x'}) > 0 and sim(v_{y}, v_{y'}) > 0.
Algorithm overview
As we mentioned before, optimization of the objective function consisting of these two goals for the pairwise network alignment problem is NPhard. For this reason, we propose a heuristic method which independently considers these two goals in sequence in order to find a multiple alignment in feasible time. The main idea here is that there exist several seed matchsets (or seeds for simplicity), in which the proteins are highly similar to each other. Therefore, our approach is to initially ignore the network topology and use just the sequence similarity when forming the seeds. Here, we adopt clustering algorithms to effectively find the highly similar protein groups and then to form the seeds within the groups. Subsequently, the seeds will be expanded and merged by taking into account the network topology to form resulting matchsets.
The overall procedure of our algorithm is illustrated in Figure 1. As can be seen, our method first preprocesses the similarity matrix and then executes following four stages. Stage 1 applies a clustering method to cluster all proteins based on the preprocessed similarity matrix; Stage 2 generates seeds according to the clusters; Stage 3 expands the seeds to conserve edges, and finally, stage 4 aligns remaining proteins. The notations in the algorithms are explained in Table 1.
Preprocessing
The similarity matrix only represents the sequence similarity. If we only identify the seeds based on this similarity matrix, the seeds would not reflect any network topology. Therefore, we adopt a simple preprocessing method to integrate a part of network topology into the similarity matrix. Note that if we consider the whole network topology, the preprocessing would be too timeconsuming (see [9]).
The main idea is that a pair of proteins with similar neighbors should be aligned together rather than a pair of proteins with a close similarity score but without any similar neighbors. Therefore we add an extra score which measures the similarity of two proteins' neighbors to the original similarity score. This extra score of a pair of proteins v_{a }and v_{b}, Net(v_{a}) ≠ Net(v_{b}), is initialized to 0 and this score is calculated in a greedy way: we select a pair of proteins each time that (1) is v_{a}'s neighbor; (2) is v_{b}'s neighbor; (3) both of them have not been selected before; and (4) should have the maximal similarity among all possible pairs satisfying the first three conditions. We add to the extra score and then iteratively select the next pair until v_{a }or v_{b}'s neighbors are all selected, and the resulting extra score is eventually added to the original score. If the extra score is larger than the original score, we set the extra score to the same value of the original score, in order to avoid overemphasizing the neighbors' similarity.
An example is shown in Figure 2. sim(A2, B2) is initially 50. The extra score of (A2, B2) is equivalent to the sum of sim(A1, B1) and sim(A3, B4) However, the extra score of (A2, B3) is 0 since sim(A1, B2) and sim(A3, B2) are both 0. Therefore, the pair (A2, B2) is more favored than (A2, B3) as the matchset (A2, B2) can conserve more edges and obtain higher average similarity score by the subsequent stages than the matchset (A2, B3).
Figure 2. An example with two networks, A and B. The two tables are the similarity scores with and without preprocessing. The solid lines connecting two proteins in the same network are edges, and bold lines are conserved edges. The arrows across two networks are matchsets. The threshold τ is 50 here. Stage 1 shows the clustering result, which is {{A1, B1}, {A2, B2, B3}, {A3, B4}}. Stage 2 generates seed matchsets. Since here are only two networks, we do not merge any matchset and therefore the seed {A2, B3} is ignored. Stage 3 expands the alignment based on the seed {A2, B2}. There are no alignable pairs after stage 3, so stage 4 is not executed in this example.
Algorithm 1 Seed Generation
Input: A set of clusters and the threshold of similarity τ.
Output: A set of seeds Ê.
1: Ê ← Ø;
2: for all C ∈ Č do
3: for all v_{x}, v_{y }∈ C : Net(v_{x}) ≠ Net(v_{y}) do
4: if sim(v_{x}, v_{y}) ≥ τ then
5: Ê ← Ê ∪(v_{x}, v_{y});
Seed generation via clustering
Seeds, which are pairs of proteins with high similarity, can be identified easily by pruning out all pairs of proteins with a similarity score lower than a threshold. However, this approach is not optimized since one protein might be similar to several proteins which are not mutually similar, and therefore the seeds generated by this approach might ruin the quality of matchsets formed by subsequent stages. As the seeds should be generated by globally considering their mutual similarity, we observe that this is equivalent to the clustering problem, in which mutually similar proteins should be clustered together.
We therefore adopt a clustering algorithm to identify the groups of similar proteins and then use Algorithm 1 to identify seeds. Algorithm 1 examines all possible pairs of proteins in each cluster (line 3), and then it uses the threshold parameter τ to determine whether a possible pair is similar enough to be a seed (line 4). Obviously, the lower the threshold is, the more seeds the algorithm generates. Hence, if the threshold is higher, we would generate more matchsets in subsequent stages. Since stage 3 considers network topology more, the resulting alignment should conserve more edges as the threshold becomes higher, but the average similarity score would be lower. If we generate too many clusters, each cluster will be too small, and therefore the resulting seeds cannot effectively span all networks. On the other hand, if there are too few clusters, the number of pairs of proteins we need to evaluate is very large, resulting in poor scalability. In this paper, we select as the number of clusters to achieve the balance between scalability and the span of matchsets, and we exclude proteins with zero similarity to all other proteins before clustering.
The clustering result directly affect the seeds generated by Algorithm 1, so it is very important for our algorithm to choose a clustering method which generates higher similarity score. Since the average similarity score of a matchset would be determined by the similarity scores of all pairs of proteins within a matchset, the clusters, which are used to generate seeds, should have nearly globular shape, i.e., any pair of proteins within a cluster should be reasonably similar. Density clustering methods, such as DBSCAN [26], are not suitable for our problem, since they tend to generate clusters with nonglobular shape. Another characteristic here is that the number of clusters is , and usual clustering methods cannot efficiently generate this huge number of clusters. Kmedoids algorithm suffers from this problem; additionally, because Kmedoids algorithm cannot effectively identify clusters with different sizes, a large group of similar proteins might be split by Kmedoids algorithm. Hence, we choose hierarchical clustering methods with an approximated criterion function, which tend to generate clusters with globular shape and different sizes, for our algorithm. We consider the agglomerative method and the repeatedbisection method implemented by CLUTO (version: 2.1.2) [27]. We use two criterion functions, I1 and I2:
Algorithm 2 Seed expansion on multiple networks
Input: A set of seeds Ê.
Output: A set of matchsets Ŝ
2: S(v_{x}) ← {v_{x}};
3: for all {v_{x}, v_{y}} ∈ Ê do
4: Push ((v_{x}, v_{y}), sim(v_{x}, v_{y})) in pq; //pq is a priority queue.
5: while pq is not empty do
6: (v_{x}, v_{y}) ← pq.pop();
7: Merge(v_{x}, v_{y}); //The boolean output of Algorithm 3 is ignored here.
8: for all matchset S : S ≥ 2 do
9: for all (v_{i}, v_{j}) ∈ S : Net(v_{i}) ≠ Net(v_{j}) do
10: for all (v_{x}, v_{y}): v_{x }∈ N (v_{i}), v_{y }∈ N (v_{j}) do
11: Push ((v_{x}, v_{y}), sim(v_{x}, v_{y})) in pq;
12: while pq is not empty do
13: (v_{i}, v_{j}) ← pq.pop();
14: if Merge(v_{i}, v_{j}) then
15: for all (v_{x}, v_{y}): v_{x }∈ N (v_{i}), v_{y }∈ N (v_{j}) do
16: Push ((v_{x}, v_{y}), sim(v_{x}, v_{y})) in pq;
17: Ŝ ← {matchset S : S ≥ 2};
Seedexpansion strategy
We observe that if a new matchset which consists of the neighbors of an existing matchset is formed, two edges connecting the existing matchset to the new matchset are conserved. Therefore, we start conserving edges by first expanding the seeds, i.e., forming new matchsets consisting of the neighbors of seeds. Since we still want to obtain higher similarity scores during expansion, we adopt a priority queue which contains all expandable pairs of proteins in order to iteratively select the expandable pair with the highest similarity score. Once a new pair is popped from the priority queue and used to form a new matchset, we put the neighbors of the new matchsets into the priority queue in order to expand the alignment. Thereby, each time we pick a pair of proteins and align them together, we conserve two edges and expand the alignment. This method is very efficient to directly conserve a higher amount of edges as we still obtain high similarity scores.
Algorithm 3 Merge
Input: A pair of proteins (v_{i}, v_{j}) and three userdefined parameters β, ρ, and ω.
Output: A boolean value indicating whether the algorithm merges S(v_{i}) and S(v_{j}).
1: if S(v_{i}) + S(v_{j}) >ω _{* }Net(S(v_{i}) ∪ S(v_{j })) then
2: return false;
3: maxSim ← max{sim(v_{x}, v_{y})v_{x}, v_{y }∈ S(v_{i}) ∪ S(v_{j })};
4: count = {(v_{x}, v_{y})v_{x }∈ S(v_{i}), v_{y }∈ S(v_{j}), sim(v_{x}, v_{y}) >maxSim × β};
5: expand Ratio = Net(S(v_{i}) ∪ S(v_{j}))/MAX(Net(S(v_{i})), Net(S(v_{j})));
6: if maxSim > 0 and count _{* }expandRatio ≥ S(v_{i}) × S(v_{j}) × ρ then
7: Merge S(v_{i}) and S(v_{j});
8: v_{i}.ali, v_{j}.ali ← true;
9: return true;
10: else
11: return false;
Figure 2 shows an example, in which there is only one seed {A2, B2}. Among all 6 expandable pairs for that seed, {A1, B1} has the highest similarity score, so we first align A1 to B1 and conserve two edges (A1, A2) and (B1, B2). We then align A3 to B4, and conserve two edges (A2, A3) and (B2, B4). The resulting alignment is {{A1, B1}, {A2, B2}, {A3, B4}}, with 4 conserved edges and the average similarity score 85/3. Note that there are many possible alignments conserving four edges, but this alignment has the maximal average similarity score among them. It is interesting to note that although we can apply the Hungarian algorithm [21], which generates the alignment {{A1, B1}, {A2, B3}, {A3, B4}} that maximizes the average similarity score (95/3), for this pairwise alignment problem, this alignment does not conserve any edge and thus it is unlikely to be biologically meaningful. Furthermore, if we do not preprocess the similarity matrix, the seed is {A2, B3} and the resulting alignment is {{A2, B3}, {A3, B2}, {A1, B1}}, which only conserves two edges and its average similarity score, 80/3, is lower than the score of the alignment with preprocessing.
Merging matchsets
In multiple alignments, each matchset usually consists of more than two proteins, some of which might be in the same network, so some of the seeds generated by Algorithm 1 and the matchsets formed by the seedexpansion strategy should be merged if the proteins in these matchsets are mutually similar enough. We introduce the procedure similar to agglomerative clustering in Algorithm 2 in order to merge matchsets. In agglomerative clustering, each individual object forms a cluster only containing itself at the beginning, and then two clusters are merged into one cluster each round. Here, each protein forms a matchset containing only itself at the beginning (line 12) and we iteratively merge two matchsets into one matchset according to the merging criterion. First, we merge those proteins contained by a seed to form a larger seed which might span more than two networks (line 37). Then, we apply the seedexpansion strategy, i.e., we expand the seeds through aligning the neighbors of the seeds together (line 816). Once a new pair of proteins is selected by the priority queue, we use the merging criterion to examine whether the two matchsets respectively containing these two proteins should be merged or not (line 14). If the merging criterion merges these two matchsets, we put all pairs of their unaligned neighbors in the priority queue in order to conserve more edges.
The merging criterion, Algorithm 3, determines whether two matchsets should be merged. If yes, the algorithm merges these two matchsets and then returns true. If no, it simply returns false. The first criterion is that the merged matchset cannot be larger than a size threshold, which is the number of networks this matchset crosses multiplied by the parameter ω (line 12). The suggested value of ω is [1.5, 2.5]. The intuition of this criterion is that a matchset should not be too large w.r.t. the number of species this matchset crosses; otherwise, the specific mapping of each protein in this matchset is ambiguous. If the merged matchset is equal to or smaller than the threshold, the second criterion is then applied: two matchsets will be merged if and only if expandRatio multiplied by the percentage of the similarity scores of all pairs across these two matchsets being larger than a threshold is larger than ρ (line 6). expandRatio, which is the expanding ratio of the number of species if these two matchsets are merged, is used to increase the possibility of two matchset crossing different networks being merged. The threshold is the product of the parameter β and the maximal similarity score among all pairs of proteins in the union of these two matchsets (line 4). The point of the second criterion is that if there are several nonhighly similar pairs of proteins across two matchsets, we do not merge these two matchsets into a new matchset since the new matchset would tremendously decrease the average similarity score. The two parameters, β and ρ, in Algorithm 3 are both in the range (0, 1]. Generally, if β or ρ is larger, it is harder to satisfy the merging criterion, and thus the average size of the resulting matchsets becomes smaller and the matchsets in the alignment would cross less networks. Note that Algorithm 3 always merges the two matchsets containing only one protein. Additionally, if ρ is larger than 0.5 and there are only 2 networks, the matchsets are limited to contain 2 proteins, which is a usual constraint for pairwise alignments, so our algorithm can be applied to pairwise alignments.
Aligning remaining proteins
Since there are no expandable pairs of proteins after stage 3 (Algorithm 2), we mainly focus on similarity scores in stage 4. The alignable pairs in this stage are all pairs of proteins for which (1) both proteins have not been aligned yet, and (2) the similarity score of the pair is not zero. The second condition is used to prune a large amount of pairs satisfying the first condition. We sort these alignable pairs and then iteratively pick the pair with the highest similarity score, and again we apply the merging criterion at the same time as Algorithm 3 to form matchsets across more than two networks. Note that since the PPI networks are usually wellconnected, the number of remaining proteins for stage 4 is typically a very small portion of all proteins.
Complexity analysis
Let d_{max }be the maximal degree among all proteins. The preprocessing needs to calculate an extra score for each pair of proteins with a similarity score larger than 0. In order to calculate the extra score, we need to sort all possible pairs of neighbors, and therefore the time complexity is . Let be the number of pairs of proteins with nonzero similarity scores. The time complexity of the preprocessing step is Note that, in practice for real data, the similarity matrix is typically a sparse matrix, where .
In stage 1, we adopt the I1 and I2 criterion functions in CLUTO, whose time complexity to cluster all proteins is O(n^{2 }log n). The complexity of stage 2, depends on the number of proteins in each cluster. For a cluster with c proteins, Algorithm 1 examines pairs of proteins. Since the number of clusters is set to , in the worst case, there are clusters, each of which contains only one protein, and one cluster containing proteins, and therefore the time complexity of stage 2 is O(n^{2}). However, if the clusters are balanced, i.e., each cluster consists of average k proteins, Algorithm 1 would only examine pairs of proteins for each cluster, resulting in O(nk^{2}) time complexity.
Let s_{max }be the maximal size of a matchset. Algorithm 3 needs to examine all pairs of proteins from two matchsets; therefore, its time complexity is . Algorithm 2 expands the alignment at most n times and each time it generates at most expandable pairs. Each expandable pair applies Algorithm 3 one time, and we use a priority queue to iteratively select one pair, so the time complexity of Algorithm 2 is . Since the number of interactions is usually linear or loglinear in the number of proteins in PPI networks, and we use parameters ρ and β to limit the matchsets merging, d_{max }and s_{max }are independent of the size of networks, and s_{max }is only proportional to the number of networks k. Thereby, the time complexity of Algorithm 2 can be reduced to O(k^{4}n log(kn)). Eventually, the worst case of stage 4 is to sort all pairs of remaining proteins, so its time complexity is O(n'^{2 }log n'), where n' is the number of unaligned proteins after stage 3. However, n' is usually a very small number for all proteins.
Hence, the total time complexity of our algorithm is O(n^{2 }log n), dominated by the clustering method. Moreover, if several alignments with different tradeoffs are required, our method just needs to preprocess and execute the most timeconsuming stage, clustering, once. All previous network alignment algorithms have to rerun the whole process from scratch [5,8,1113].
Result and discussion
Experiment setup
In this section, we present experimental studies on real datasets. We performed our experiment on three real databases, DOMAIN [18], DIP (version 10/10/2010) [28] and BioGRID (version 3.0.65) [29]. We only extracted proteins with FASTA sequences in order to compute their similarity scores.
Table 2 presents a summary of these datasets. We use Gene Ontology terms [30] as standard functional annotations and exclude the annotation generated by electronic method (IEA). The similarity scores in DIP and BioGRID datasets are BLAST bit scores computed by BLAST package [31]. The raw BLAST bit scores are widely used to measure the similarity between two protein sequences in previous PPI alignment research [5,811,24]. In order to measure how similar two protein sequences are, we use the normalized BLAST bit scores given by [5] instead of the raw BLAST bit scores. The normalized BLAST bit scores are computed as
Table 2. Experimental datasets
The raw BLAST bit scores favor long sequences while the normalized BLAST bit scores, whose values are between 0 and 1, are independent of the sequence length.
The DOMAIN dataset is a subset of an oldversion dataset from DIP (version 10/14/2008). It requires that the sequence of each protein in the DOMAIN dataset should contain at least one domain, where a domain is defined as a FASTA sequence pattern. We then use the information of domains to calculate the similarity score: the similarity score in the DOMAIN dataset is
where D is the set of common domains of v_{x }and v_{y }, and PV is the pvalue calculated by Pfam [25,32]. Note that if two proteins do not have common domains, the similarity score is zero.
The experiments were performed on a dual core machine (Intel core i5 650) with 3.2 GHz of processor speed and 16GB of main memory. We discuss the tradeoff when tuning τ in the discussion of different clustering methods. The parameters (β, ρ, ω) in Algorithm 3 were set to default values (0.1, 0.5, 2.5) respectively, in all experiments unless otherwise noted.
In addition to the number of conserved edges and average similarity score, we used pvalue [33], and the number of enriched GO terms [30] to evaluate functional consistency. The pvalue is computed through the cumulative hypergeometric distribution. Let the total number of proteins of all networks be N with a total number of M proteins annotated with a certain GO term and a matchset containing n proteins in which m proteins are annotated with this GO term. The pvalue of this matchset w.r.t this GO term is
and we select the lowest pvalue among all GO terms as the pvalue of this matchset. Following standard practice, a GO term is considered enriched if the pvalue of one of the matchsets respect to this GO term is less than 10^{4}.
For multiple alignments, most existing algorithms do not align all proteins; in other words, some proteins would not belong to any matchset. Coverage, which is defined as the number of proteins contained by a matchset, can be used as a metric for evaluating multiple alignments. The wider the coverage is, the more comprehensive the alignment is for the networks. Since different alignments have different ranges of coverage, it is not objective to simply compare the number of conserved edges. Another metric that can be used here is conserved edge rate which is defined as number of conserved edges divided by the number of edges whose ending vertices are both covered by the alignment. Although coverage can be optimized by aligning every protein, it is difficult for an alignment with wider coverage to achieve high average similarity score and high conserved edge rate.
Comparison between different clustering methods
Figure 3 shows the qualitative similarity scores and the conserved edge rates of the agglomerative method (agglo) and the repeatedbisection method (rbr) with criterion functions I1 and I2 in DIP dataset. In Figure 3, we show the effect of varying τ. We found that the range [30, 300] gives robust results. For the DIP and DOMAIN datasets, we found that τ ∈ [0.1, 0.5] gives meaningful results. An obvious trend for all methods, is that the conserved edge rate decreases when greater weight is given to sequence similarity. Given a certain average similarity score, the agglomerative method always has higher conserved edge rate than the repeatedbisection method. Since the repeatedbisection method performs m1 bisections to generate m clusters and each bisection just ensures the criterion function is locally optimized, the large number of clusters would adversely affects the global criterion function. In contrast, since the average number of proteins in each cluster is very small (the number of networks k), the agglomerative method is not only faster but also generates more similar clusters than the repeatedbisection method. Moreover, criterion function I1 can achieve larger highest similarity score than criterion function I2 in both the repeatedbisection method and the agglomerative method. Since I1 is similar to the measurement of the average similarity score of each cluster, clusters generated by I1 would result in matchsets with higher similarity scores. Similar results are observed in BioGRID and DOMAIN datasets. Hence, we adopt the agglomerative clustering method with criterion function I1 in the following section.
Figure 3. The tradeoffs in DIP dataset. The average similarity score and the conserved edge rate of different clustering methods (agglo and rbr) and different criterion functions (i1 and i2).
Comparison with IsoRankN
When evaluating the performance, we compare our method with preprocessing against IsoRankN [10]. We empirically find that the matchsets detected by IsoRankN for different parameters is fairly stable. For each experiment, we pick the best of these. Table 3 reveals the comparison between our method and IsoRankN on DOMAIN, DIP, BioGRID datasets. Our method outperforms IsoRankN on these three datasets in terms of coverage, average similarity score, and the number of enriched GO terms while the conserved edge rates of both algorithms are similar. The main reason is that the agglomerative clustering method in our algorithm can effectively identify groups of proteins with mutually similar sequences, while IsoRankN, which adopts spectral partitioning on its score matrix, cannot effectively find these groups. Moreover, the seedexpansion strategy can expand the alignment through the edges connecting to the seeds, and the stage 4 can align remaining proteins with high similarity, so the coverage is larger in our algorithm. It is important to note that since the number of interactions will increase (typically a multiple) with the number of covered proteins, it is difficult to retain the same conserved edge rate as the coverage is enlarged. We especially observe this for the network S. pombe in BioGRID dataset, which is denser than other networks. Overall we find that our method conserves much more edges than IsoRankN, providing a more comprehensive topological mapping.
Table 3. Comparison of quality between our method and IsoRankN
We also observe that the preprocessing can offer an improvement on the number of strictly conserved edges. The numbers of strictly conserved edges on DOMAIN, DIP, BioGRID datasets are 787, 1868, 1808 respectively without preprocessing while the numbers are 1125, 2243, 2216 respectively with preprocessing.
Figure 4 shows the lowest (best) 500 pvalues on these three datasets. Not surprisingly, both algorithms yield slightly lower pvalues in DOMAIN than DIP, since all genes in DOMAIN have at least one sequence pattern so these genes tend to share similar functions. The pvalues in three datasets clearly demonstrate that our method comfortably outperforms IsoRankN in terms of producing functionally relevant results. The main reason is that groups of proteins identified by the agglomerative clustering method can generate the seeds sharing highly similar functions. Subsequently, the seedexpansion method can form new matchsets sharing similar functions by conserving edges connecting to the seeds since two proteins interacting with each other tend to share the same functions. Finally, the merging criterion always preferentially merge two matchsets with a large number of mutually highly similar proteins, so the resulting matchsets potentially conserve biological functions.
Figure 4. The lowest 500 pvalues on each dataset. (a) DOMAIN (b) DIP (c) BioGRID.
Figure 5 displays two of the topranked matchsets from our algorithm on DIP and BioGRID datasets respectively. For the DIP dataset, our method finds the best matchset consisting of 14 proteins from five networks, while IsoRankN does not cover one of them and aligns the remaining 13 proteins into five different matchsets. All 14 proteins have a subsequence highly similar to the domains PF03952.10 (Enolase_N) and PF00113.1 (Enolase_C) and except DIP52012N and DIP6859N, these proteins are all annotated with glycolysis (GO: 0006096). For the BioGRID dataset, our method discovers the matchset consisting of 7 proteins from three networks, but IsoRankN does not cover two of them and aligns the remaining five proteins in all different matchsets. These seven proteins are all associated with the domain PF00105.12 and six of them (except P15370) associated with the domain PF00104.24. Moreover, these seven proteins are all annotated with steroid hormone receptor activity (GO: 0003707). Our point here is that not only do we obtain greater coverage, we also report more functionally relevant groupings and do so at a fraction of the execution cost as we discuss next.
Figure 5. The best matchset (with the lowest pvalue) discovered by our algorithm. Proteins with the same circle line style are in the same matchset formed by IsoRankN. Proteins without any circle are not covered by IsoRankN. The same color presents the same species. (a) DIP (DIP ID) (b) BioGRID (UniprotKB AC).
Our method significantly outperforms IsoRankN in terms of execution time since IsoRankN iteratively updates a huge matrix whose size is proportional to the square of the total number of proteins in all networks. It is important to note here that the dominant cost in our method is the time to preprocess and run the clustering algorithm. These two steps only have to be executed once to generate alignments with different tradeoffs. Keep in mind that IsoRankN has to be rerun from scratch if the tradeoff weight is changed. For DOMAIN, DIP, and BioGRID datasets, the alignment stage only takes 5 to 40 seconds while the preprocessing step and clustering stage totally cost 37, 234 and 175 seconds respectively for our method, and IsoRankN takes 4621, 41719, and 35224 seconds on average respectively. Because IsoRankN does need to to rerun the whole process every time the threshold parameters are changed, as shown in Figure 6, the running time of IsoRankN is proportional to the number of alignments it generates; however, our method, which just needs to reexecute the alignment stages, is about three orders of magnitude faster to execute when generating five alignments. Note that once we need more alignments to be generated, the gap of execution time between our method and IsoRankN becomes larger.
Figure 6. Execution time for generating several alignments.
To summarize, we find that our algorithm provides both a significant qualitative as well as quantitative (efficiency) benefit over the stateoftheart IsoRankN algorithm.
Conclusions
In this article, we present an efficient global PPI network alignment algorithm. Although our approach can also be applied to pairwise alignments, it mainly addresses multiple alignments, which are more comprehensive. In order to efficiently identify functional orthologs across multiple networks, we propose the merging criterion and apply the seedexpansion strategy and clustering techniques to conserve interactions and find similar protein sequences. Results on a number of real datasets highlight the effectiveness, efficiency, and scalability of our algorithm when we compared with the stateoftheart multiple alignment algorithm, IsoRankN. From a qualitative standpoint, our approach also offers a significant advantage over IsoRankN for the multiple alignment problem.
In this work we do not explicitly consider the case of weighted PPI networks. In many cases weights representing the confidence associated with a detected interaction may be available. As part of future work, we would like to investigate the GNA problem with weighted interaction representing the evidence of interaction existence. A promising direction here is to develop a mechanism to integrate the similarity score with the interaction confidence. Another possible strategy is to adopt stateoftheart graph clustering algorithms, such as [33,34], first and then use the sequence similarity to align similar clusters together.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
YKS developed software, conducted experiments, and drafted the manuscript. SP suggested the initial idea which was subsequently refined by both authors, guided the development and analysis of the method, helped design the experiments and cowrote the manuscript. Both authors read and approved the final manuscript.
Acknowledgements
We thank Tyler Clemons, Venu Satuluri, Mikhail Zaslavskiy, ChungShou Liao, and Xin Guo for providing experimental data and helpful suggestions for improving this work. This work is supported in part by the following NSF grants: CCF0702587, IIS0917070 and IIS 1141828. The work described herein reflects the opinions of the authors and does not necessarily reflect the opinions of the NSF.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 3, 2012: ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/13/S3.
References

Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive twohybrid analysis to explore the yeast protein interactome.
Proc Natl Acad Sci USA 2001, 98(8):45694574. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mann M, Aebersold R: Mass spectrometrybased proteomics.
Nature 2003, 422:198207. PubMed Abstract  Publisher Full Text

Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, Sittler T, Karp RM, Ideker T: Conserved patterns of protein interaction in multiple species.
Proc Natl Acad Sci USA 2005, 102(6):1974. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bandyopadhyay S, Sharan R, Ideker T: Systematic identification of functional orthologs based on protein network comparison.
Genome Res 2006, 16(3):428435. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chindelevitch L, Liao CS, Berger B: Local optimization for global alignment of protein interaction networks.
Pac Symp Biocomput 2010, 123132. PubMed Abstract

LavalléeAdam M, Coulombe B, Blanchette M: Detection of locally overrepresented GO terms in proteinprotein interaction networks. In Research in Computational Molecular Biology. Springer; 2009:302320.

Li ZP, Zhang SH, Wang Y, Zhang XS, Chen L: Alignment of molecular networks by integer quadratic programming.
Bioinformatics 2007, 23(13):16311639. PubMed Abstract  Publisher Full Text

Zaslavskiy M, Bach F, Vert JP: Global alignment of proteinprotein interaction networks by graph matching methods.
Bioinformatics 2009, 25(12):i259i267. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Singh R, Xu J, Berger B: Pairwise global alignment of protein interaction networks by matching neighborhood topology. In Research in Computational Molecular Biology. Springer; 2007:1631.

Liao CS, Lu K, Baym M, Singh R, Berger B: IsoRankN: spectral methods for global alignment of multiple protein networks.
Bioinformatics 2009, 25(12):i253i258. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zaslavskiy M, Bach F, Vert JP: A path following algorithm for the graph matching problem.
IEEE Trans Pattern Anal Mach Intell 2009, 31(12):22272242. PubMed Abstract  Publisher Full Text

Klau G: A new graphbased method for pairwise global network alignment.
BMC Bioinformatics 2009, 10(Suppl 1):S59. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bayati M, Gerritsen M, Gleich DF, Saberi A, Wang Y: Algorithms for large, sparse network alignment problems. In 2009 Ninth IEEE International Conference on Data Mining. IEEE; 2009:705710.

Kalaev M, Bafna V, Sharan R: Fast and accurate alignment of multiple protein networks.
J Comput Biol 2009, 16(8):989999. PubMed Abstract  Publisher Full Text

Mehta S, Hazzard K, Machiraju R, Parthasarathy S, Wilkins J: Detection and visualization of anomalous structures in molecular dynamics simulation data. In Proceedings of the Conference on Visualization '04. IEEE Computer Society; 2004:465472.

Koyutürk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, Grama A: Pairwise alignment of protein interaction networks.
J Comput Biol 2006, 13(2):182199. PubMed Abstract  Publisher Full Text

Flannick J, Novak A, Srinivasan BS, McAdam HH, Batzoglou S: Græmlin: general and robust alignment of multiple large interaction networks.
Genome Res 2006, 16(9):11691181. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guo X, Hartemink AJ: Domainoriented edgebased alignment of protein interaction networks.
Bioinformatics 2009, 25(12):i240i246. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR, Ideker T: Conserved pathways within bacteria and yeast as revealed by global protein network alignment.
Proc Natl Acad Sci USA 2003, 100:1139411399. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T: PathBLAST: a tool for alignment of protein interaction networks.
Nucleic Acids Res 2004, 32(Web Server issue):W83W88. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuhn HW: The Hungarian method for the assignment problem.
Naval Research Logistics Quarterly 1955, 2:8397. Publisher Full Text

Flannick J, Novak AF, Do CB, Srinivasan BS, Batzoglou S: Automatic parameter learning for multiple network alignment.

Singh R, Xu J, Berger B: Global alignment of multiple protein interaction networks with application to functional orthology detection.
Proc Natl Acad Sci USA 2008, 105(35):1276312768. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Finn R, Mistry J, Tate J, Coggill P, Heger A, Pollington J, Gavin O, Gunasekaran P, Ceric G, Forslund K, et al.: The Pfam protein families database.
Nucleic Acids Res 2010, 38(Suppl 1):D211. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ester M, Kriegel HP, Sander J, Xu X: A densitybased algorithm for discovering clusters in large spatial databases with noise.

CLUTO  Family of Data Clustering Software Tools [http://glaros.dtc.umn.edu/gkhome/views/cluto] webcite

Xenarios I, Salwínski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions.
Nucleic Acids Res 2002, 30:303305. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Breitkreutz BJ, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner DH, Bähler J, Wood V: The BioGRID interaction database: 2008 update.
Nucleic Acids Res 2008, 36(Suppl 1):D637. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT: Gene ontology: tool for the unification of biology.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

NCBI Learning Center [http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/] webcite
[Accessed September 23, 2011]

Search Pfam [http://pfam.sanger.ac.uk/search] webcite
[Accessed October 26, 2011]

Asur S, Ucar D, Parthasarathy S: An ensemble framework for clustering proteinprotein interaction networks.
Bioinformatics 2007, 23(13):i29i40. PubMed Abstract  Publisher Full Text

Satuluri V, Parthasarathy S, Ucar D: Markov clustering of protein interaction networks with improved balance and scalability. In ACMBCB. ACM; 2010:247256.