Discovery and analysis of consistent active sub-networks in cancers

Gaire, Raj K; Smith, Lorey; Humbert, Patrick; Bailey, James; Stuckey, Peter J; Haviv, Izhak

doi:10.1186/1471-2105-14-S2-S7

Volume 14 Supplement 2

Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics

Proceedings
Open access
Published: 21 January 2013

Discovery and analysis of consistent active sub-networks in cancers

Raj K Gaire^1,2,
Lorey Smith³,
Patrick Humbert³,
James Bailey¹,
Peter J Stuckey¹ &
…
Izhak Haviv^4,5

BMC Bioinformatics volume 14, Article number: S7 (2013) Cite this article

4159 Accesses
10 Citations
Metrics details

Abstract

Gene expression profiles can show significant changes when genetically diseased cells are compared with non-diseased cells. Biological networks are often used to identify active subnetworks (ASNs) of the diseases from the expression profiles to understand the reason behind the observed changes. Current methodologies for discovering ASNs mostly use undirected PPI networks and node centric approaches. This can limit their ability to find the meaningful ASNs when using integrated networks having comprehensive information than the traditional protein-protein interaction networks. Using appropriate scoring functions to assess both genes and their interactions may allow the discovery of better ASNs.

In this paper, we present CASNet, which aims to identify better ASNs using (i) integrated interaction networks (mixed graphs), (ii) directions of regulations of genes, and (iii) combined node and edge scores. We simplify and extend previous methodologies to incorporate edge evaluations and lessen their sensitivity to significance thresholds. We formulate our objective functions using mixed integer programming (MIP) and show that optimal solutions may be obtained.

We compare the ASNs obtained by CASNet and similar other approaches to show that CASNet can often discover more meaningful and stable regulatory ASNs. Our analysis of a breast cancer dataset finds that the positive feedback loops across 7 genes, AR, ESR 1, MYC, E 2F 2, PGR, BCL 2 and CCND 1 are conserved across the basal/triple negative subtypes in multiple datasets that could potentially explain the aggressive nature of this cancer subtype. Furthermore, comparison of the basal subtype of breast cancer and the mesenchymal subtype of glioblastoma ASNs shows that an ASN in the vicinity of IL 6 is conserved across the two subtypes. This result suggests that subtypes of different cancers can show molecular similarities indicating that the therapeutic approaches in different types of cancers may be shared.

Background

The full genome sequencing of cancer cases demonstrates how remarkably heterogeneous cancer cases are [1]. This heterogeneity is consistent with the hypothesis that most mutations are innocent bystander consequences of the failure of cancer cells' intrinsic mechanism to repair and guard the integrity of the genome [2]. However, the observed heterogeneity of the cancer mutations combined with the knowledge of multiple lesions that all could lead to the same phenotypic consequence [3], leads to a new emerging hypothesis. According to this competing hypothesis, intrinsic subtype specific cancer causing mutations are rare, but their biological output is common [4].

The recognition that cancer stem cells within a tumour mass uniquely carry the potential for overt malignancy [5, 6] and the discovery that these cells can be transformed into and change forms between epithelial or mesenchymal cells, a phenomena known as epithelial-mesenchymal transformation (EMT) [7], has increased our insight into the link between EMT and fatal cancer phenotypes, such as metastasis and resistance to treatments. In addition, the discovery of intrinsic subtypes of breast cancer that express unique groups of genes [8] has advanced its prognosis. Some intrinsic subtypes of breast cancer are associated with elevated susceptibility to specific drugs, such as Herceptin (for amplified HER 2 cases) and Tamoxifen (for ER+ cases), while other subtypes, such as the mesenchymal basal/triple negative cases remain without a matching therapeutic strategy. Being able to compare subtypes of different cancers may help identify genes causing the specific subtypes of cancers, leading to identify better therapeutic targets. More importantly, this could provide a scientific basis to sharing therapeutic strategies in subtypes of different cancers.

The task of interpreting gene expression profiles in a disease is not only to differentiate the non-random changes from the random and irrelevant changes, but also to identify the disease causing changes, their downstream effects and the cellular responses related to the disease. If such a procedure worked, one would expect to see the intrinsic signature of luminal breast cancer subtype emerging as downstream to gene ER. Furthermore, one could identify driver mutations of the mesenchymal/basal subtype for which therapeutic strategies fail to work. Biological interaction networks contain immense amount of knowledge suitable for such analysis [9]. Finding active subnetworks in diseases is a typical analysis which uses such networks to generate meaningful biological contexts from the differentially expressed genes.

An active subnetwork (ASN) is a subnetwork of a biological interaction network in which the significant nodes obtained from an experiment are connected by edges defined in the network [10]. A methodology for finding ASN was initially proposed by Ideker et al. [10]. When coining the idea of ASNs, they established the goal of finding ASNs that could answer questions such as "What are the signalling and regulatory interactions in control of the observed gene expression changes? How is this control exerted?" To achieve this, several variations of their work have been proposed [11–17] to analyse differentially expressed genes in diseases using protein-protein interaction (PPI) networks.

PPI networks are undirected networks. Therefore, the ASNs obtained by using these networks can show signalling and regulatory information, but without the directionality of edges, they cannot explicitly show how the controls are exerted. Furthermore, PPI networks have two problems. One, individual interaction networks exhibit little overlap [18], suggesting that a single interaction network might not contain complete information. To overcome this problem, different interaction networks are combined in single integrated interaction network as a mixed graph, containing different types of biological interactions, such as activation, inhibition and post-translational modification. Second, these networks contain both false positive and false negative edges [19], suggesting that the qualities of edges may not be consistent across the network. This problem is solved by computing confidence values of the edges from the sources from which the edge information is obtained from (e.g. number of sources). Alternatively, the values are defined by co-relations of genes in the experimental data [16].

With the integrated networks containing more comprehensive information, one would expect to obtain more informative ASNs by using them with the existing methods. In fact, Deshpande et al. [17] used the direction of regulation of genes in multiple species and showed that the identified ASNs are more stable and consistent across multiple species. Similarly, other tools such as IPA (Ingenuity^® Systems, http://www.ingenuity.com) show directed edges in their outputs. However, these methods do not use node and edge information together which could produce better ASNs from this type of networks.

We have identified the three issues: node centricity, sensitivity to p-value thresholds and inability to compare ASNs, which limit the existing methods from finding ASNs that could explain how the genetic controls are exerted in diseases (see supp text for details). In addition, the existing ASN finding tools require users to have a copy of the entire network database prior to starting any analysis, which can further affect their usability.

In this paper, we present CASNet (C onsistent A ctive S ubnet works), our novel methodology that uses (i) an integrated interaction network, STRING [20], (ii) directions of gene regulations, and (iii) combined node and edge evaluations, to find better ASNs. We model the objective function using a mixed integer programming (MIP) model and then solve the model using CPLEX to efficiently discover optimal ASNs. Furthermore, CASNet uses web based APIs to access only relevant parts of the interaction networks and does not require a local copy of the entire database obtained prior to using this tool. We use simulated datasets to show that CASNet can address the above identified limitations. Additionally, we use publicly available datasets to identify and compare ASNs of cancers that provides interesting biological insights. CASNet and supp text of this paper are available at http://www.csse.unimelb.edu.au/~rgaire/CASNet/.

Results and discussion

In this section, firstly we will show that by adding edge information, our method not only selects interaction edges which are consistent with the experimental results, but also helps reduce the sensitivity to p-value significance threshold. Secondly, we will present our analysis of breast cancer and comparison of ASNs in mesenchymal subtypes of breast and glioblastoma cancers.

Comparing with node centric approaches

In order to evaluate CASNet, we used simulated networks of varying nodes and compared our results directly with the result obtained from three different methods (i) jActiveModule [10], (ii) heinz [15] and (iii) CEZANNE [16]. jActiveModule is a Cytoscape [21] implementation of Ideker et al. ASN finding problems being NP hard, it uses the simulated annealing heuristic approach to obtain optimal solutions. In contrast, heinz is an implementation of Dittrich et al. which models this problem as a Prize Collecting Steiner Tree problem and uses Integer Programming techniques to obtain the exact solution. Heinz is also available as a Bioconductor package, which was used for this comparison. Finally, CEZANNE is MATISSE [22] module and uses not only the significance of nodes but also the similarity of nodes as well as an assessment of edges based on the correlations of nodes. These three methods cover a broad range of techniques that are currently available for finding ASNs. By comparing our results with the results obtained from these methods, we can understand the performance of our method against other similar methods. In addition to this, we also compared the results obtained by using our method disregarding the directionality of edges to understand the performance differences attributed to the directionality. These experiments were performed for different network sizes to assess the robustness of our method. At this point, we note that some approaches have been developed that use co-expression [23], co-variance [14] and correlation [16] of genes for creating or assessing edges in the networks. Although useful, these methods may not be applicable when only a list of genes is available.

We used the following parameters of the individual methods: jActiveModule was used with the default parameters to obtain only 1 module. For heinz, a false discovery rate (FDR) of 0.05 was used (more stringent FDR thresholds had poor performance). For two nodes a and b, CEZANNE required node similarity scores matrix, sim[a][b]. This was calculated from p-values of the nodes, p_a and p_a, as sim[a][b] = sim[b][a] = min(p_a, p_b)/max(p_a, p_b) such that the nodes having similar p-values obtain high similarity scores (≈1) while the nodes having different p-values obtain low similarity scores (≈0). A p-value threshold of 0.05 was used for both CEZANNE and CASNet. Precisions (true positive/all classified as positive) and recalls (true positive/all positives) for both nodes and edges were added for each methods for comparisons.

Figure 1 illustrates the performance of different methods. It shows that jActiveModule finds modules with low precision and high recall. Therefore, the modules obtained by this method are often bigger and may contain large number of false positives. In contrast, heinz produces modules with high precision but low recall, hence discards many true positive nodes and edges. CEZANNE produced ASNs with better recall, but worse precision than heinz. Note that CEZANNE and jActiveModule failed to find modules in small and large networks respectively. Additionally, jActiveModule produced inconsistent precision and recall whereas CEZANNE produced consistent precision but inconsistent recall for different network sizes. In contrast, heinz performed consistently, though with lower recall rates.

Since CASNet relaxes the significance threshold, it starts by selecting a large number of nodes. The edges connecting these nodes are then assessed based on their directionality and confidence scores. This is also apparent in Figure 1. Here, CASNet without considering directionality of edges (CASNet_NCA) performs with high precision but low recalls. The recall in CASNet is highly improved by using edge information. This demonstrates that by using additional information of edges, the quality of ASNs can be dramatically improved (see Supp text for further comparisons).

Analysis of cancer datasets

In a recent study, Iliopoulos et al. [24] noted that the genes related to inflammatory pathways form stable PFLs regulated by IL 6 causing continuous progression of cancer. Their study involved extensive laboratory based works. By using computational methods instead, we can potentially not only reduce costs and efforts of identifying PFLs, but also identify novel PFLs.

ASNs and PFLs in breast cancer

We used a publicly available breast cancer dataset to explore ASNs and PFLs in the basal subtype of breast cancer. SAM analysis [25] was performed to obtain a list of significant genes. This list was used with STRING network [20] to find ASNs and PFLs.

Figure 2 is the ASN obtained from the gene list of basal breast cancer. This ASN contains PFLs as shown in Figure 3. Besides other genes, we found that the PFLs across MYC, E 2F 1, AR, ESR 1, CCND 1, PGR and BCL 2 were conserved across independent breast cancer dataset as shown in Figure 4. However, we did not find any PFL when the differentially expressed genes from the entire dataset without discriminating the cancer subtypes were used. ESR (i.e. ER) is one of the genes which differentiates luminal and basal subtypes [8]. The breast cancer patients with low ER expression levels have poorly survival rates. Our identification of these PFL in ER- samples with oncogene like MYC and tumour suppressor gene like CCND 1 with the breast cancer discriminating gene ESR 1 is a novel finding. Existence of such PFLs probably explains the reasons behind the resistance to the therapeutic in this cancer subtype.

In order to obtain independent evidence for the CASNet choice of expression-based nodes and edges, we used the TCGA query portal [26] to find genes included in the ASN which had consistent expression and copy number changes. Here, we assumed that a gene could be causal if it is not affected by other genes and have mRNA expression level changes consistent with the copy numbers of the genes. FOXA 1, NCOA 7 and DOCK 7 were the top three genes in this result. Since DOCK 7 is a downstream signalling intermediate of ERBB 2, identification of DOCK 7 aberrations in the absence of HER 2 over-expression may implicate DOCK 7 in Transtuzumab drug resistance. Moreover, when FOXA 1 and NOCA 7 were considered with the PFL forming genes, all the samples had at least one gene that had consistently changed expression levels. This further suggests that PFLs and their neighbouring genes could be important in understanding the nature of complex cancer cases.

ASNs in glioblastoma vs breast cancer

Here, we used Glioblastoma (GBM) dataset from Verhaak et al. [27] to obtain an ASN for mesenchymal subtype cases and compare it with the ER- breast cancer ASN, since both of these subtypes of cancers show mesenchymal signatures, have poor patient survival and are likely to be related to the EMT event.

Figure 5 shows the similar components of ASN in these subtypes of cancers. Here, PPARG, SREBF 2, E 2F 1, MYBL 2 and MYCN are the genes which are differently regulated while several other genes are similarly regulated in the two cancer subtypes. PPARG is up-regulated in mesenchymal GBM, but down-regulated in ER- breast cancer. This gene can both enhance and suppress the expression of PTGS 2 and ICAM 1, which are up-regulated in both subtypes. PTGS 2 (COX 2, an aspirin target) is a key mediator of inflammation [28]. This behaviour of PPARG is likely expressed by stromal adipocytes (fat cells), which are known to accentuate inflammatory processes in a number of human pathologies, including cancer.

IL 6 is the most connected node in this similarity network. It is associated with acute inflammation, suggesting that higher expression of IL 6 could be a cellular response to inflammation. MYC, which is found to be forming PFLs in basal/triple negative breast cancer cases, is seen to be regulated by IL 6, IL 1B, IFNG and HIF 1A, and is conserved in both mesenchymal GBM and ER- breast cancer. This independently confirming the role of IL6 in cancers and suggests that maintaining high level of inflammation may be a conserved feature of mesenchymal subtypes of cancers. More generally, our finding of the common pathways in the different subtypes of cancers suggests that even though the genetic signatures among different cancers may not be similar, the cancer subtypes might not only have similarities in their genetic signatures but also have similarly affected pathways and could potentially be treated in the same manners.

Materials and methods

In this section, we describe the datasets and our methodology for finding ASNs.

Simulated networks

Here, we created simulated experimental datasets and consistent gold standard ASNs. These ASNs were combined to obtain simulated network. The experimental datasets and the simulated networks were then used to obtain ASNs from different methods and compared against the gold standard ASNs. More specifically, we fist created nodes of sizes N = 100; 1000; 10000. p-values were then assigned to these nodes such that n = 0:1 x N randomly selected nodes obtained values smaller than 0:001 (considered as the significant nodes), while other nodes were assigned uniformly distributed values between 0 and 1. 2 x n random pairs of significant nodes were assigned directed up- and down-regulating edges consistent with the direction of regulation of nodes. The confidence scores of all the edges were assigned a constant value, 1. Each of the above experiment with different node sizes was repeated for 10; 15 and 20 times by randomly reassigning p-values but changing the directions of regulations of a fraction (0:5) of significant nodes to add randomness in the experimental results. Since the edges are reassigned in each experimental data, this variation in node regulation can create different edges among the same two nodes as found in real biological networks.

Biological network data

We used the action networks of STRING [20] as our network data source. The network was access at run time via the publicly available web based APIs. The benefits of this approach are that downloading and maintaining the entire database is not required, making CASNet usable in a computer with internet connection. The assessed parts of STRING were saved locally for future use, thereby avoiding excessive internet usage.

Experiment data

The breast cancer samples GSE2034 [29, 30] were obtained from obtained from GEO [31]. The GEO dataset contained 282 samples, with 206 were ER+ and 75 ER- cases. The TCGA BRCA dataset contained 465 samples. TCGA [32] (BRCA) dataset was used as an independent validation dataset.

The GBM dataset was obtained from a TCGA publication [27] containing 206 samples. It categorised the samples into 4 subtypes: Proneural (PN), Neural (NL), Classical (CL) and Mesenchymal (MES) based on their genetic signatures. Their SAM analysis [25] results of MES subtype were taken as a basis for the significance measurement of the genes (see supp text for details).

Interaction networks

Let V_exp be a set of biological molecules being investigated in an experiment. The differential expression analysis finds molecules M = (V_exp, P, D) with p-value significances P, and directions of regulations (up- or down-regulation) D.

A network (or graph) G = (V, E) consists of vertexes (or nodes) V which are connected by edges E. Given two nodes v₁, v₂ ∈ V, and edge e ∈ E; e = (v₁, v₂, t, c) connects the nodes v₁ and v₂ by an edge with a type t and a confidence value c. The confidence value of an edge is in the range of (0, 1] where 0 is the least confidence and 1 is the most confidence value. Two nodes may be connected by multiple edges with different values of t. The following four types of edges (t) are defined between nodes: (i) physical binding of two nodes, denoted by v₁ - v₂, (ii) catalytic and post-transcription modification of a node by another node, denoted by v₁ ⊸ v₂, (iii) activation of a node by another node, denoted by v₁ → v₂, and (iv) inhibition of a node by another node, denoted by v₁ ⊣ v₂. For an edge e, a confidence value c is defined. Now, the problem of finding an ASN can be stated as follows:

Problem statement

Given an interaction network, G = (V, E), and a differentially expressed (DE) gene dataset, M = (V_exp, P, D), find the best (where the notion of best is defined by scoring functions) subnetwork G' = (V', E') ⊂ G which connects the highly DE genes v ∈ V', V_exp ⊂ V with edges E' ⊂ E that are consistent with the data.

Consistent edge

An edge e = (n₁, n₂, t, c) ∈ E between two nodes n₁, n₂ ∈ V is considered to be consistent if the direction of regulation of gene n₂ can be explained by the direction of regulation of n₁ by using the edge type t with a high confidence c.

For example, if n₁ promotes n₂, i.e. n₁ → n₂, and both n₁ and n₂ are up- or down-regulated, then the edge e = (n₁, n₂, →, c) is considered to be consistent with the data with a confidence c. In contrast, if a node n₁ is up-regulated and the other node n₂ is down-regulated and vice-versa, then the edge is considered to be inconsistent with the data.

Based on the problem statement, we now define the node, edge and subnetwork scores that will be used to evaluate ASNs.

Node scores

In an experiment, DE genes are often associated with p-values obtained from some statistical tests. A subnetwork with nodes having low p-values is a desirable feature of an ASN [10]. Additionally, it is often desirable to exclude highly connected nodes in an ASN [33–35]. Here we adopt Dittrich et al's [15] node scoring method to include low p-values in the networks (profit). At the same time, we use a simple approach to penalise highly connected nodes (cost).

Let p_n be the p-value of a node n. Then the node's profit score W_n is given as [15],

W_{n} = (a - 1) \times (l o g (p_{n}) - l o g (τ))

(1)

where a = (0, 1] is a shape parameter of the beta distribution fitted for a dataset representing a signal to noise ratio, and τ is a threshold to control the size of the ASN (interpreted as the false discovery rate). In this case, the value of (a - 1) acts as a scaling factor. This factor is useful to compare the scores of a node obtained from different datasets having different p-value distributions. On the other hand, τ can be a selected threshold which can make a node score positive or negative, and thereby controls the size of the ASNs. Since the value of a does not affect the sign of the scores, and if a single p-value dataset is used, a can be assigned a constant value without affecting the resulting ASN. Practically, a is close to 0. Therefore, the node score can be simplified by assigning a = 0 as W_n = -1 × (log(p_n) - log(τ)). Furthermore, if a set of significant genes is used in an experiment after applying a p-value threshold, the Eq. 1 can be further reduced to, W_n = 1 × log(p_n) for all the genes in the set and W_n = -1 × |Constant| for the genes not in the list.

If D is the degree of a node n (i.e. it is connected to D other nodes), we assign a cost C_n to the node to penalise highly connected nodes, as, C_n = log(D).

Since the profit and the cost scores of a node do not have the same scale, we scale these values to a range of [-1, 1] to obtain standard scores as, $W_{n}^{'} = \frac{W_{n}}{m a x_{m \in V} (| W_{m} |)}; C_{n}^{'} = \frac{C_{n}}{m a x_{m \in V} (| C_{m} |)}$

Edge scores

It is desirable to assign a high positive score to an edge which is consistent with the data, so that including such edges will increase the subnetwork scores. Similarly, the inconsistent edges should be penalised by assigning negative scores to them. Additionally, the confidence value of an edge should be used to scale these scores. Therefore, we use the following scheme to obtain the edge scores:

1.
If n₁ promotes n₂, then the consistency score W_e is equal to (i) 2 if both n₁ and n₂ are changed in the same direction, (ii) -1 if either n₁ or n₂ is unchanged and (iii) -2, if n₁ and n₂ are changed in opposite direction.
2.
Similarly, if n₁ suppresses n₂, then the consistency score W_e is equal to (i) 2 if n₁ and n₂ are changed in an opposite direction, (ii) -1 if only one of n₁ and n₂ is changed and (iii) -2, if n₁ and n₂ are changed in the same direction.
3.
For edges n₁ ⊸ n₂ and n₁ - n₂, a constant value can be assigned depending on whether including such edges is desirable or not. For example, W_e = 1 could be used in PPI networks, while W_e = -1 could be used in the networks where having these edges is not desirable. By default, we use W_e = -1 to lessen emphasis on undirected edges.

Now, an edge score is defined as, S_e = W_e × C_e where C_e = (0, 1] is the confidence score of the edge obtained from the interaction network.

Finally, the standard edge score is obtained as, $S_{e}^{'} = \frac{S_{e}}{m a x_{f \in E} | S_{f} |}$

Sub-network score

Based on the standard node and edge scores, the score of a subnetwork G'-is obtained by-using the linear equation,

S = \sum_{n \in V} x_{n} \times (α \times {W^{'}}_{n} - β_{n} \times C_{n}^{'}) + γ \times \sum_{e \in E} x_{e} \times {S^{'}}_{e}

(2)

where α, β and γ are the scaling factors of the node weight profits, the node connectivity costs and the edge consistency scores respectively and x_n and x_e are boolean variables with values 1 if n ∈ V', e ∈ E' and 0 otherwise. The values of these scaling factors could be obtained by using gold standard network and experiment datasets. In the absence of such datasets, we use α = γ = 1 and β = 0 for the positive scoring nodes and β = 1 for negative scoring nodes in our experiments. This is because a large number of edges around cancer related genes exist in the biological networks, since a high number experiments have been performed in those genes. Penalising them at the same rate as others eliminates the highly DE genes (results not shown).

The objective function for finding ASNs is to obtain a sub-network which maximises the subnetwork score S.

The MIP model

Here, we model the problem of finding an ASN by using the mixed integer linear programming (MIP) model in CPLEX which maximises the objective function in Eq. 2. x_n and x_e are defined as boolean variables (i.e. x ∈ 0, 1). Further to this, the following additional constraints are imposed: (a) x_n(i)→ ∃x_{e(i, j)}i.e. x_n(i)≤ Σ_jx_{e(i, j)}. (b) x_{e(i, j)}→ x_n(i); i.e. x_{e(i, j)}≤ x_n(i). (c) x_{e(i, j)}→ x_n(i); i.e. x_{e(i, j)}≤ x_n(i). where n(i) is the i^th node and e(i, j) is an edge connecting the nodes n(i) and n(j).

Conclusion

A large number of datasets that are currently being produced, such as TCGA and ICGC, include definitive genome wide mutational status of many samples, making the task of interpreting the results, and identifying common features even more formidable. Adapting to these high dimensional datasets, biological interaction databases are being integrated into single databases to provide more comprehensive information. Since all the interactions among the nodes might not be active at the same point of time or environmental conditions, the current methodologies of enrichment assessment for candidate networks or pathways can fall short in their ability to discriminate between real biological inferences from false positive ones. Better methodologies are required to use these networks and systematically find interesting and meaningful interactions in disease conditions.

In this paper, we presented a methodology to analyse gene expression results of diseases with STRING network, to not only find a connected subset of nodes that are observed to be highly differentially expressed, but also use the edges in the network to generate hypotheses regarding the reason behind the observed changes. Our methodology, CASNet, enhances existing methodologies by introducing edge scores to solve node centricity, p-value sensitivity problems and subnetwork comparability problems.

We demonstrated that complicated regulatory ASNs and PFLs exist in low surviving cancer cases such as the mesenchymal subtype of GBM and the basal/triple negative subtype of breast cancer. Finally, we showed that by comparing ASNs of different disease types, molecular similarities of the disease can be identified that can be useful in their treatments. In this way, CASNet has widened the possibilities of network analysis in generating biologically significant hypotheses and directing the future researches.

Limitations

Firstly, the current state of directed interaction network has low coverage. Additionally, a large number of edges around the genes of well studied diseases, including cancer, exist due to the large number of experiments that have been carried out with those genes. This creates a bias toward some parts of networks. The parts of network where directionality information exists are most likely the pathways that are already well-understood. As such, the discovery of new genes and their interactions from these networks may be less likely. Secondly, we found several instances where the edges are incorrectly defined in STRING. The edges in our ASNs are only as good as the curation and literature mining methodologies used in creating the networks. Finally, the interaction networks do not differentiate wild type/mutant and active/inactive molecules in their nodes. In the absence of this level of sensitivity in the existing biological networks, precise conclusions cannot be made. Consequently, the results obtained from ASN finding methods are susceptible to the problems associated with the underlying networks and the datasets being used, and hence require independent validations.

Declaration

The publication costs for this article were funded by the University of Melbourne.

This article has been published as part of BMC Bioinformatics Volume 14 Supplement 2, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S2.

References

McDermott U, Downing JR, Stratton MR: Genomics and the Continuum of Cancer Care. N Engl J Med. 2011, 364 (January 27): 340-50.
CAS PubMed Google Scholar
Lai Y, Eckenrode SE, She J: A statistical framework for integrating two microarray data sets in differencial expression analysis. BMC Bioinformatics. 2009, 10 (Suppl 1): S23-10.1186/1471-2105-10-S1-S23.
Article PubMed Central PubMed Google Scholar
Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, Silliman N, Szabo S, Dezso Z, Ustyanksky V, Nikolskaya T, Nikolsky Y, Karchin R, Wilson PA, Kaminker JS, Zhang Z, Croshaw R, Willis J, Dawson D, Shipitsin M, Willson JKV, Sukumar S, Polyak K, Park BH, Pethiyagoda CL, Pant PVK, Ballinger DG, Sparks AB, Hartigan J, Smith DR, Suh E, Papadopoulos N, Buckhaults P, Markowitz SD, Parmigiani G, Kinzler KW, Velculescu VE, Vogelstein B: The Genomic Landscapes of Human Breast and Colorectal Cancers. Science. 2007, 318 (5853): 1108-13. 10.1126/science.1145720.
Article CAS PubMed Google Scholar
Bignell GR, Greenman CD, Davies H, Butler AP, Edkins S, Andrews JM, Buck G, Chen L, Beare D, Latimer C, Widaa S, Hinton J, Fahey C, Fu B, Swamy S, Dalgliesh GL, Teh BT, Deloukas P, Yang F, Campbell PJ, Futreal PA, Stratton MR: Signatures of mutation and selection in the cancer genome. Nature. 2010, 463 (18 February): 893-8.
Article PubMed Central CAS PubMed Google Scholar
Jordan CT, Guzman ML, Noble M: Cancer Stem Cells. N Engl J Med. 2006, 355: 1253-61. 10.1056/NEJMra061808.
Article CAS PubMed Google Scholar
Reya T, Morrison SJ, Clarke MF, Weissman IL: Stem cells, cancer, and cancer stem cells. Nature. 2001, 414: 105-11. 10.1038/35102167.
Article CAS PubMed Google Scholar
Thiery JP, Acloque H, Huang RY, Nieto MA: Epithelial-Mesenchymal Transitions in Development and Disease. Cell. 2009, 139 (5): 871-90. 10.1016/j.cell.2009.11.007.
Article CAS PubMed Google Scholar
Perou CM, Jeffrey SS, van de Rijn M, Rees CA, Eisen MB, Ross DT, Pergamenschikov A, Williams CF, Zhu SX, Lee JCF, Lashkari D, Shalon D, Brown PO, Botstein D: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. PNAS. 1999, 96 (16): 9212-17. 10.1073/pnas.96.16.9212.
Article PubMed Central CAS PubMed Google Scholar
Ideker T, Sharan R: Protein networks in disease. Genome Res. 2008, 18 (4): 644-52. 10.1101/gr.071852.107.
Article PubMed Central CAS PubMed Google Scholar
Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signaling circuits in molecular interaction networks. Bioinformatics. 2002, 18: S233-40. 10.1093/bioinformatics/18.suppl_1.S233.
Article PubMed Google Scholar
Sohler F, Hanisch D, Zimmer R: New methods for joint analysis of biological networks and expression data. Bioinformatics. 2004, 20: 1517-21. 10.1093/bioinformatics/bth112.
Article CAS PubMed Google Scholar
Rajagopalan D, Agarwal P: Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics. 2005, 21 (6): 788-793. 10.1093/bioinformatics/bti069.
Article CAS PubMed Google Scholar
Scott MS, Perkins T, Bunnell S, Pepin F, Thomas DY, Hallett M: Identifying Regulatory Subnetworks for a Set of Genes. Molecular & Cellular Proteomics. 2005, 4: 683-692. 10.1074/mcp.M400110-MCP200.
Article CAS Google Scholar
Guo Z, Li Y, Gong X, Yao C, Ma W, Wang D, Li Y, Zhu J, Zhang M, Yang D, Wang J: Edge-based scoring and searching method for identifying condition-responsive protein-protein interaction sub-network. Bioinformatics. 2007, 23 (16): 2121-2128. 10.1093/bioinformatics/btm294.
Article CAS PubMed Google Scholar
Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Müller T: Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics. 2008, 24 (ISMB 2008): i223-i231.
Article PubMed Central CAS PubMed Google Scholar
Ulitsky I, Shamir R: Identifying functional modules using expression profiles and confidence-scored protein interactions. Bioinformatics. 2009, 25 (9): 1158-64. 10.1093/bioinformatics/btp118.
Article CAS PubMed Google Scholar
Deshpande R, Sharma S, Verfaillie CM, Hu WS, Myers CL: A scalable approach for discovering conserved active subnetworks across species. PLoS Computational Biology. 2010, 6 (12): e1001028-10.1371/journal.pcbi.1001028.
Article PubMed Central PubMed Google Scholar
Prieto C, Rivas JDL: APID: Agile Protein Interaction DataAnalyzer. Nucleic Acids Research. 2006, 34 (Web Server): W298-W302. 10.1093/nar/gkl128.
Article PubMed Central CAS PubMed Google Scholar
Finley R: A Guide to Yeast Two-Hybrid Experiments. 2007, Cambridge, MA: Cell Press, 24: 17-21. chap. 5
Google Scholar
Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009, 37: D412-D416. 10.1093/nar/gkn760.
Article PubMed Central CAS PubMed Google Scholar
Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research. 2003, 13 (11): 2498-504. 10.1101/gr.1239303.
Article PubMed Central CAS PubMed Google Scholar
Ulitsky I, Shamir R: Identification of functional modules using network topology and high-throughput data. BMC Syst Biol. 2007, 1: 8-10.1186/1752-0509-1-8. open access
Article PubMed Central PubMed Google Scholar
Zhang B, Horvath S: A General Framework for Weighted Gene Co-Expression Network Analysis. Stat Appl Genet Mol Biol. 2005, 4: Article17-
PubMed Google Scholar
Iliopoulos D, Hirsch HA, Struhl K: An Epigenetic Switch Involving NF-kB, Lin28, Let-7 MicroRNA, and IL6 Links Inammation to Cell Transformation. Cell. 2009, 139: 693-706. 10.1016/j.cell.2009.10.014.
Article PubMed Central CAS PubMed Google Scholar
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to ionizing radiation response. PNAS. 2001, 98: 5116-21. 10.1073/pnas.091062498.
Article PubMed Central CAS PubMed Google Scholar
The Cancer Genome Atlas Group: The Cancer Genome Atlas data browser. 2011, [http://tcga-portal.nci.nih.gov/tcga-portal/AnomalySearch.jsp]
Google Scholar
Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O'Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN: Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010, 17: 98-110. 10.1016/j.ccr.2009.12.020.
Article PubMed Central CAS PubMed Google Scholar
The UniProt Consortium: The Universal Protein Resource. Nucl Acids Res. 2010, 38 (4): D142-8.
Article PubMed Central Google Scholar
Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365 (9460): 671-9.
Article CAS PubMed Google Scholar
Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S, Sementchenko V, Fox EA, Silver PA, Gingeras TR, Liu XS, Brown M: Genome-wide analysis of estrogen receptor binding sites. Nat Genet. 2006, 38 (11): 1289-97. 10.1038/ng1901.
Article CAS PubMed Google Scholar
Edgar R, Domrachev M, Lash A: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002, 30: 207-10. 10.1093/nar/30.1.207.
Article PubMed Central CAS PubMed Google Scholar
The Cancer Genome Atlas Group: The Cancer Genome Atlas website. 2011, [http://tcga-data.nci.nih.gov/tcga/]
Google Scholar
Croesa D, Couche F, Wodak SJ, van Helden J: Inferring Meaningful Pathways in Weighted Metabolic Networks. J Mol Biol. 2006, 356: 222-236. 10.1016/j.jmb.2005.09.079.
Article Google Scholar
Faust K, Dupont P, Callut J, van Helden J: Pathway discovery in metabolic networks by subgraph extraction. Bioinformatics. 2010, 26 (9): 1211-1218. 10.1093/bioinformatics/btq105.
Article PubMed Central CAS PubMed Google Scholar
Dezső Z, Nikolsky Y, Nikolskaya T, Miller J, Cherba D, Webb C, Bugrim A: Identifying disease-specific genes based on their topological significance in protein networks. BMC Syst Biol. 2009, 3: 36-10.1186/1752-0509-3-36.
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

This work is supported by National ICT Australia (NICTA), which is funded by the Australian Government's Backing Australia's Ability initiatives, in part through the Australian Research Council (ARC), Komen for the cure, National Health and Medical Research Council (NHMRC) Cancer Australia, Nation Breast Cancer Foundation (NBCF) and Cancer Council Victoria (CCV).

Author information

Authors and Affiliations

NICTA, Victoria Laboratory and Department of Computing and Information Systems, University of Melbourne, Parkville, Vic, 3010, Australia
Raj K Gaire, James Bailey & Peter J Stuckey
Metabolomics, Population Studies and Profiling, Baker IDI Heart and Diabetes Institute, Melbourne, Vic, 3004, Australia
Raj K Gaire
Cell Cycle & Cancer Genetics, Peter MacCallum Cancer Centre, Melbourne, Vic, 3002, Australia
Lorey Smith & Patrick Humbert
Department of Pathology, School of Medicine, University of Melbourne, Parkville, Vic, 3010, Australia
Izhak Haviv
Faculty of Medicine in Galilee, Bar Ilan University, Israel
Izhak Haviv

Authors

Raj K Gaire
View author publications
You can also search for this author in PubMed Google Scholar
Lorey Smith
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Humbert
View author publications
You can also search for this author in PubMed Google Scholar
James Bailey
View author publications
You can also search for this author in PubMed Google Scholar
Peter J Stuckey
View author publications
You can also search for this author in PubMed Google Scholar
Izhak Haviv
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raj K Gaire.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RKG, JB, PJS, IH - conception, design, development, data acquisition, analysis, interpretation of results, draft manuscript; LS, PH - conception, preliminary interpretation of results.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Gaire, R.K., Smith, L., Humbert, P. et al. Discovery and analysis of consistent active sub-networks in cancers. BMC Bioinformatics 14 (Suppl 2), S7 (2013). https://doi.org/10.1186/1471-2105-14-S2-S7

Download citation

Published: 21 January 2013
DOI: https://doi.org/10.1186/1471-2105-14-S2-S7

Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics

Discovery and analysis of consistent active sub-networks in cancers

Abstract

Background