Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, 68198, USA

Department of Computer Science, State University of New York at Albany, 1400 Washington Ave., Albany, NY 12222, USA

Department of Chemistry and Biochemistry, University of Northern Iowa, Cedar Falls, IA 50614, USA

Bioinformatics and Systems Biology Core Facility, University of Nebraska Medical Center, Omaha, NE, 68198, USA

Abstract

Background

Protein-protein interaction (PPI) networks carry vital information about proteins' functions. Analysis of PPI networks associated with specific disease systems including cancer helps us in the understanding of the complex biology of diseases. Specifically, identification of similar and frequently occurring patterns (network motifs) across PPI networks will provide useful clues to better understand the biology of the diseases.

Results

In this study, we developed a novel pattern-mining algorithm that detects cancer associated functional subgraphs occurring in multiple cancer PPI networks. We constructed nine cancer PPI networks using differentially expressed genes from the Oncomine dataset. From these networks we discovered frequent patterns that occur in all networks and at different size levels. Patterns are abstracted subgraphs with their nodes replaced by node cluster IDs. By using effective canonical labeling and adopting weighted adjacency matrices, we are able to perform graph isomorphism test in polynomial running time. We use a bottom-up pattern growth approach to search for patterns, which allows us to effectively reduce the search space as pattern sizes grow. Validation of the frequent common patterns using GO semantic similarity showed that the discovered subgraphs scored consistently higher than the randomly generated subgraphs at each size level. We further investigated the cancer relevance of a select set of subgraphs using literature-based evidences.

Conclusion

Frequent common patterns exist in cancer PPI networks, which can be found through effective pattern mining algorithms. We believe that this work would allow us to identify functionally relevant and coherent subgraphs in cancer networks, which can be advanced to experimental validation to further our understanding of the complex biology of cancer.

Background

Protein-protein interaction (PPI) networks carry vital information on the molecular functions and biological processes of cells. Analysis of PPI networks associated with specific disease systems including cancer helps us to better understand the complex biology of diseases. PPI networks are dynamically modulated in a tissue-specific microenvironment; hence, a set of similarly expressed genes from two types of cancer tumors may exhibit different PPI patterns. A lot of gene expression data has been accumulated on cancer-specific tumors warranting the need for developing effective algorithms to translate the differentially expressed gene lists into functionally coherent modules that are common to all cancers or shared in a given subset of cancers. To achieve this, genes are mapped to corresponding proteins and known PPIs are represented as a network graph for further analysis. Using graph theory-based algorithms, pairs of networks can be compared to identify common, distinct or frequent sub-networks. These sub-networks containing a set of proteins (nodes) with a distinct set of connections (edges) can represent a functional unit in a pathway or in a biological process. Similarly, frequent sub-networks (network motifs) may represent recurring functional units within a network or among multiple networks. In this study, we focus on developing a graph-based algorithm to identify common and frequent network motifs from PPI networks of nine different cancers.

Graphs have been widely used to model a variety of data types such as PPI networks

Existing methods for graph comparison can be categorized into the following three major types: distance-based, alignment-based and kernel-based methods. In a distance-based method, similarity of graphs is measured based on the graphs' common structures _{mcs}
_{1}|, |_{2}|} where |V| is the number of nodes in graph G = (V, E)

The alignment-based methods utilize the idea of graph alignment that is conceptually similar to sequence alignment. In sequence alignment, different scores or penalties are assigned for matches, mismatches and gaps, and the alignment algorithm looks for the best way to arrange the sequences so that the overall alignment score is maximized. In graph alignment, the similarities of graphs are determined by the conservation of interactions, which is measured through the edges and similarity of nodes

The third approach, using kernel-based methods measures graph similarities through kernel functions. Existing graph kernels can be viewed as a special case of R-convolution kernels proposed by Haussler

One of the most important tasks in the analysis of PPI networks is to predict functional modules that represent either stable protein complexes or groups of transiently interacting proteins that together can accomplish a biological function. These functional modules can be mapped to specific subgraphs in PPI networks. Below, we discuss three methods that have been used to extract substructures from graphs: (i) frequent subgraph identification, (ii) graph segmentation and (iii) core-based clustering.

Graph segmentation method extracts substructures by partitioning graphs into disjoint dense subgraphs. K-means clustering

In contrast to the graph segmentation method, where the central nodes of the subgraphs are usually randomly chosen, in core-based clustering the central nodes are selected before clustering is performed

Due to the NP-hardness of many graph problems, most of the previous methods offer approximate solutions to measure graph similarity. In this paper we present a method that produces the exact solutions in graph comparison and pattern identification. Our algorithm works in a bottom up fashion. It starts from one-node subgraph, and proceeds to one-edge and multiple-edge subgraph. At each loop the search space is reduced by eliminating parts of networks that are not eligible for next round of comparison. Even though the run-time increases exponentially as the size of subgraph increases, in our case the size of the search space, as the base of the exponential, reduces quickly. Therefore we can obtain the complete result in a reasonable amount of time. As we look for common substructures across the networks, we also perform graph isomorphism test. Graph isomorphism problem is known to be in NP; however, it's unknown to be in P or NP-complete if P ≠ NP. In our specific context of network comparison, we solve this in polynomial time with our pattern-labeling algorithm.

We applied our algorithm on nine cancer associated PPI networks to identify common and frequent patterns in these networks. We collected differentially expressed genes from microarray studies of various solid tumor tissues derived from the Oncomine database

Results and discussion

Cancer protein interaction networks

Our PPI networks are constructed from a comprehensive, non-redundant dataset of experimentally-derived PPIs

We collected differentially expressed genes (DEGs) between tumor and normal samples from microarray studies of nine different solid-tumor cancer types using the Oncomine database

Number of genes and proteins mapped under each cancer network.

**Cancer type**

**Number of genes**

**Number of proteins**

**Edge count**

**Node count**

Bladder cancer

11771

29286

47909

10726

Breast cancer

11373

26498

33558

8611

Cervical cancer

9811

22447

19332

6288

Colorectal cancer

18982

40905

58212

13273

Esophagus cancer

5135

13380

13405

4218

Gastric cancer

12137

28224

41289

9707

Melanoma

8763

22421

30843

7677

Pancreatic cancer

17339

37160

52125

12199

Prostate cancer

11181

27598

41658

9621

Similar to many PPI networks, cancer PPI networks also exhibit power-law degree distributions (Figure

Power-law distribution of PPI networks from nine different cancers

**Power-law distribution of PPI networks from nine different cancers**.

Network analysis

The reason we are interested in frequent patterns is that the presence of these subgraphs in PPI networks constitute an analogy to motifs in multiple sequence alignment. These frequent subgraphs represent conserved functional modules that play significant roles in the disease systems we study. First we look for frequent subgraphs within a network because of the possibility of finding more than one identical subgraph from nodes that belong to the same cluster (see below). Then we perform comparative analysis across multiple networks to measure the commonality across networks. These subgraphs must be connected components, which is a prerequisite for forming protein complexes or pathways. Our method of frequent pattern extraction involves the following three key steps: identification of node similarity, graph isomorphism test and discovery of frequent patterns.

Identification of node similarity

Each node in a PPI network represents a unique protein. Nodes are considered similar if the proteins they represent have similar functions. We use the sequence alignment algorithm Blastclust

Graph isomorphism test

The basic idea in canonical graph labeling

Figure

Canonical labeling of subgraph structures

**Canonical labeling of subgraph structures**. **2A: **The columns of the adjacency matrix are arranged according to the natural order of node labels. As this is a complete graph, there are edges between every pair of distinct nodes. Therefore non-diagonal elements are all 1. And since there is no self-loop in the graph, the diagonal elements are all 0. The canonical label [V1, V2, V3, V4]0111011010 is formed of two parts. The first part [V1, V2, V3, V4] is the concatenation of node labels, delimited by comma. The second part 0111011010 is the concatenation of upper triangle of adjacency matrix. Two parts are separated by square bracket. **2B: **Three of the nodes are having same cluster ID, which results in three possible adjacency matrices to be constructed.

PageRank algorithm

In Figure

Computing the weighted adjacency matrix

**Computing the weighted adjacency matrix**.

From adjacency matrix, we can compute hyperlink matrix, denoted as H.

The hyperlink matrix generated from the above example is

Hyperlink matrix is a stochastic matrix. Every column of H sums to 1. The entry H[i, j] indicates the probability of moving from node j to node i. It can also be understood as the ratio of contribution node j makes to node i among all nodes j connected to. Let v be the vector storing relative importance of nodes. v[i] denotes the relative importance of node i. A node's relative importance is determined by the contribution all other nodes have made to it. So we need to solve the equation Hv = v. This is actually to find the Eigen vector corresponding to eigenvalue 1 of matrix H. Eigenvalue computation can be performed in polynomial time.

It shows that A1 and A2 are of the same relative importance. They will be included in the same equivalence class. B1 and B2 will also be included in the same equivalence class. Then we sort nodes based on cluster ID at first level and equivalence class at second level. In matrix M when we shuffle nodes in the same equivalence class, the matrix content will not be changed; the canonical label remains the same. Therefore permutations are not needed to generate a unique pattern label.

In Figure

Using the algorithm described above we can generate pattern labels for graphs. Generally it takes O(n^{3}) time to compute eigenvalue decomposition. Constructing adjacency matrix and hyperlink matrix each takes O(n^{2}) time. Sorting of nodes takes O(n lg n) time. Thus the algorithm to compute pattern labels runs in polynomial time.

Discovery of frequent patterns

Finding frequent subgraphs is an NP-hard problem. When the size of the subgraph is a variant, finding frequent subgraphs takes exponential run-time. Therefore, to solve frequent subgraphs problem we need to effectively reduce the search space as subgraph size increases. To accomplish this, we take the bottom up approach to find small subgraphs first and proceed to larger subgraphs. We start with frequent subgraphs of 1 node. We look for clusters with size no less than the given threshold in each network. This can be done through a simple counting of nodes within each cluster in each network. Among the selected clusters, we look for those present in all networks. Nodes belonging to these clusters are kept; the rest are removed from the networks. Edges incident to removed nodes are also removed from the networks. On the remaining part of the networks we will discover patterns of next size level.

Frequency downward closure is an important property that most of the frequent-subgraph-finding algorithms are based on. It is essential for the computational tractability of most frequent subgraph discovery algorithms

Graph showing the number of identified patterns versus pattern size

**Graph showing the number of identified patterns versus pattern size**.

**List of 2-node subgraphs**.

Click here for file

**List of 3-node subgraphs**.

Click here for file

**List of 4-node subgraphs**.

Click here for file

**List of 5-node subgraphs**.

Click here for file

**List of 6-node subgraphs**.

Click here for file

**List of 7-node subgraphs**.

Click here for file

**List of 8-node subgraphs**.

Click here for file

**List of 9-node subgraphs**.

Click here for file

**List of 10-node subgraphs**.

Click here for file

Figure

Each of the patterns listed in Figure

Multiple subgraphs of the MYC pattern that vary by nodes of the same cluster at an equivalent position

**Multiple subgraphs of the MYC pattern that vary by nodes of the same cluster at an equivalent position**. The 4 subgraphs have similar nodes (TUBA4A, TUBA8, TUBA1B and TUBA1A) at corresponding positions. Therefore they belong to the same pattern.

Performance validation

We compared our method with FSG, which is a frequent subgraph-mining algorithm

The subgraph patterns identified by us are frequent within each network and also common to all the nine cancer networks. Hence, we hypothesize that each subgraph corresponds to an important functional module in cancer. We used GO semantic similarity

To test this hypothesis, we compared sets of randomly generated subgraphs (SG_{Rand}) against the sets identified by our algorithm (SG_{Cancer}). We generated random sets of 1000 subgraphs for each edge-group of size n (n = 4-10) from the human PPI network. In other words, both sets of SG_{Rand and }SG_{Cancer }subgraphs are derived from the same parent interactome, but they differ in the node and edge topologies they contain. We computed the average semantic similarity scores of SG_{Rand }and SG_{Cancer }subgraphs for each edge-group. The results of the comparison are shown in Figure _{Cancer }subgraphs are substantially higher than those of the SG_{Rand }subgraphs at all edge-group levels tested. This result validates that the SG_{Cancer }subgraphs identified by our algorithm are functionally coherent modules. Still, the question remains as to what kind of a role do they play in cancer. To address this, we have further studied a select set of subgraphs from different edge-groups to understand their role in different cancers.

Validation of the prediction performance using GO semantic similarity scores

**Validation of the prediction performance using GO semantic similarity scores**. The purple line represents average GO scores of cancer subgraphs and the blue line represents those of randomly generated subgraphs, at each edge-group level.

Role of subgraph patterns in cancer

The 10-edge subgraph primarily consists of the glucocorticoid receptor (NR3C1), three of its coactivators (CREBBP, NCOA1, and NCOA3) and one co-repressor (NCOR2). In addition, there are three transcriptional regulators (STAT3, STAT5A and RELA) and an RNA binding motif protein (RBM8A). All the known direct and indirect interactions among these proteins are shown in Figure

Ingenuity pathway analysis of the 10-edge pattern subgraph showing cancer-associated interactions among its nodes

**Ingenuity pathway analysis of the 10-edge pattern subgraph showing cancer-associated interactions among its nodes**. The edges represent both physical (direct) and regulatory (indirect) relationships.

We also looked at some of the smaller subgraphs containing 2-8 edges and found a number of network patterns associated with cytoskeletal functions. One of the 8-edge patterns is related to a functional unit consisting of actin (α, β and γ isoforms) and six actin associated genes, ACTR1A, CCT5, GSN, SPTAN1, TPM1, DYNLL1 and their homologs, that are differentially expressed across nine cancer types. CCT5 is a molecular chaperone, and is part of the TCP1 ring complex, known to fold various proteins including actin and tubulin. We find that CCT5 is uniformly up-regulated across datasets. We hypothesize that CCT5 may play an important role in ensuring the correct folding of cytoskeletal proteins that are produced during cell proliferation in cancer. It is well known that the actin cytoskeleton is substantially modified in transformed cells, and this occurs in concert with changes in a host of actin filament-associated regulatory proteins

In the 5-edge group of patterns, we have identified a functional module centered on the well-known oncogene MYC, and Myc binding proteins, Max, Mycbp2 (PAM), and SP1, that are differentially regulated in nine cancers. Interestingly, this functional pattern also includes α and β tubulins and their homologs in various subgraphs as shown in Figure

Conclusion

In this paper, we present a novel algorithm for mining frequent and common patterns across multiple cancer PPI networks. The comprehensive PPI datasets used in this study exhibit power-law distribution across all cancer networks. By using effective canonical labeling and adopting weighted adjacency matrices, we are able to perform graph isomorphism test in polynomial running time. The search starts from small patterns of 1 node, proceeds by incrementing the subgraph size 1 edge at a time, and stops when no frequent patterns are discovered for a certain edge level. As the size increments, the infrequent edges in the original networks are removed, thus reducing the search space for the next round of searching. We applied the algorithm on nine cancer PPI networks and identified frequent and common patterns of different sizes up to 10 edges. To validate the performance of our method, we compared these patterns against the randomly generated patterns at each edge-group, using GO semantic similarity measure. Patterns identified in this study exhibited significantly higher scores compared to the random ones at all edge-group levels indicating that these patterns are functionally cohesive modules. Further investigations on the specific role of each module in cancer revealed their intricate association with various cancer-associated processes such as transcriptional regulation, cell growth, cell proliferation, etc. Ingenuity pathway analysis of a 10-edge module demonstrated that the cancer-associated functions are tightly dependent among the nodes of the subgraph as evidenced by both direct and interactions. Based on these results, we believe that the methodology developed in this study is capable of identifying common and frequent subgraphs from large and multiple interaction networks. While we used cancer PPI networks in our study, this is a generic methodology and hence can be applied to mine subgraphs from many other networks.

Methods

Human protein interactome dataset

We created a comprehensive, non-redundant dataset of experimentally-derived interacting proteins by combining multiple datasets (downloaded in the PSI MI 2.5 format) from five major protein interaction databases that include DIP (Database of Interacting Proteins)

Calculation of GO semantic similarity

The semantic similarity of GO terms between two interacting proteins was calculated for all possible pairs of proteins in the human PPI network. The GO terms associated with each protein were obtained from the GO database. The GO annotation (GOA) for a protein can be based on three concepts i.e., biological process (P), molecular function (F) and cellular component (C). The best semantic similarity measure between the GO terms of the two proteins, under each GO concept, was determined for all pairs of proteins using the method proposed by Brown and Jurisica

Semantic similarity is the probability of minimum subsumer, _{ms }
_{1 }
_{2 }
_{1}, g_{2}) _{1 }
_{2}
_{ms }

A similarity measure based on this probability is then calculated as the negative log probability of minimum subsumer, using the following equation.

In brief, the similarity score between two GO terms is higher if they share a common parent with a more specific GO term (less frequent), and vice versa. The total similarity score is the sum of the best similarity scores from each concept.

Graph theory preliminaries

Definition 1 (Labeled graph) A labeled graph is a triple G = (V, E, μ), where

• V is the node set

• E is the edge set, E ⊆ V × V

• μ:V → L_{V }is a function assigning labels to nodes

In PPI networks, nodes are labeled with protein IDs. Since each protein appears at most once in a PPI network, no two nodes share same labels. Formally: ∀ v_{i}, v_{j }∈ V, v_{i }≠ v_{j }→ μ(v_{i}) ≠ μ(v_{j}).

Definition 2 (Undirected graph, connected graph) A graph G = (V, E, μ) is an undirected graph if and only if

∀v_{i}, v_{j }∈ V: (v_{i}; v_{j}) ∈ E ↔ (v_{j }; v_{i}) ∈ E. In an undirected graph G, two nodes v_{i }and v_{j }are connected if G contains a path from v_{i }to v_{j}. A graph is said to be connected if every pair of nodes in the graph are connected.

Definition 3 (Subgraph) Graph G' = (V', E', μ') is a subgraph of graph G = (V, E, μ) if V' ⊆ V and E' ⊆ (V' × V') ∩ E) and μ' = μ.

Definition 4 (Graph isomorphism) Given two labeled graphs G = (V, E, μ) and G' = (V', E', μ'). Graph isomorphism is a bijective function f: V → V' such that ∀v_{i}, v_{j }∊ V, (v_{i}, v_{j}) ∊ E ↔ (f(v_{i}), f(v_{j})) ∊ E'.

Definition 5 (Frequent subgraph) Given a graph G = (V, E, μ), support(g) is the number of isomorphic embeddings of subgraph g. A subgraph is frequent if its support is no less than a given minimum support threshold.

Algorithms

**Algorithm 1 **frequentCommonDiscover(G,σ)

_{i }in G do

_{i }← Find node clusters with size no less than

^{0 }← Find node clusters that are present in all C_{0 }~ C_{k}

_{i }in G do

^{0}

_{i }in G do

_{i }← Find edge groups with size no less than

^{1 }← Find edge groups that are present in all L_{0 }~ L_{k}

_{i }in G do

^{1}

^{t-1 }is not empty do

_{i }in G do

_{j }in E do

_{j})

_{j})

^{t }← Find subgraphs patterns that are present in all P_{0 }~ P_{k}

_{i }in G do

^{t}

**Algorithm 2 **patternLabel(E)

Author information

RS is a graduate student in CG's lab with training in computer science and this work is part of her dissertation research. NCWG is an Associate professor with training in biochemistry and molecular biology. CG (Associate professor) has an interdisciplinary background in molecular and computational biology. He has published a number of computational methods with a variety of applications in biomedical research, since 2001.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RS carried out this work, developed the method, analyzed the results and drafted the manuscript. NG assisted in the functional analysis of the identified subgraphs and in manuscript preparation. CG conceived of the study, provided overall conceptual framework for this paper, analyzed the results and wrote part of the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

This work was partly supported by NIH/NIGMS grants to CG [1R01GM086533 and 1R15GM080681]; and startup funds to CG from the University of Nebraska Medical Center.

This article has been published as part of