Email updates

Keep up to date with the latest news and content from BMC Systems Biology and BioMed Central.

This article is part of the supplement: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Systems Biology

Open Access Proceedings

Prioritizing protein complexes implicated in human diseases by network optimization

Yong Chen123, Thibault Jacquemin2, Shuyan Zhang3 and Rui Jiang2*

Author Affiliations

1 School of Information Science and Engineering, University of Jinan, Jinan 250014, China

2 MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China

3 Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China

For all author emails, please log on.

BMC Systems Biology 2014, 8(Suppl 1):S2  doi:10.1186/1752-0509-8-S1-S2


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1752-0509/8/S1/S2


Published:24 January 2014

© 2014 Chen et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Abstract

Background

The detection of associations between protein complexes and human inherited diseases is of great importance in understanding mechanisms of diseases. Dysfunctions of a protein complex are usually defined by its member disturbance and consequently result in certain diseases. Although individual disease proteins have been widely predicted, computational methods are still absent for systematically investigating disease-related protein complexes.

Results

We propose a method, MAXCOM, for the prioritization of candidate protein complexes. MAXCOM performs a maximum information flow algorithm to optimize relationships between a query disease and candidate protein complexes through a heterogeneous network that is constructed by combining protein-protein interactions and disease phenotypic similarities. Cross-validation experiments on 539 protein complexes show that MAXCOM can rank 382 (70.87%) protein complexes at the top against protein complexes constructed at random. Permutation experiments further confirm that MAXCOM is robust to the network structure and parameters involved. We further analyze protein complexes ranked among top ten for breast cancer and demonstrate that the SWI/SNF complex is potentially associated with breast cancer.

Conclusions

MAXCOM is an effective method for the discovery of disease-related protein complexes based on network optimization. The high performance and robustness of this approach can facilitate not only pathologic studies of diseases, but also the design of drugs targeting on multiple proteins.

Keywords:
Complex Disease; Protein Complex; Genomic Data Integration; Network Optimization

Background

Protein complexes are essential cellular functional units in which several proteins work as parts of assemblies. The functionality of a protein complex is based on interactions of its member proteins that are typically densely connected in a protein-protein interaction (PPI) network, reflecting the modular organization of the network. In pathogenic conditions, dysfunctions of complex members usually affect the entire function of the complex [1-3]. Although systematic genetic and epigenetic analyses in human inherited diseases have revealed numerous SNPs [4-9], miRNAs [10], long noncoding RNAs [11], individual disease proteins [12] and epigenetic modifications [13], functional associations between diseases and protein complexes are still lack of systematic investigations.

Protein complexes have been experimentally and computationally proved to be associated with amounts of diseases. For example, different mutations in SWI/SNF chromatin remodelling complex were reported to cause Coffin-Siris syndrome [14,15], Nicolaides-Baraitser syndrome [16], and cancers [17,18]. Aberration in mitochondrial complex-I NADH dehydrogenase activity could profoundly enhance the aggressiveness of human breast cancer cells, while therapeutic normalization of the NAD+/NADH balance could inhibit metastasis and prevent disease progression [19]. mTOR complex 1 played a critical role in hematopoiesis and Pten-loss-evoked leukemogenesis [20]. In recent years, several system-level maps of protein complexes have been constructed in yeast [21-23], drosophila melnogaster [24] and human [25], presenting significant efforts towards comprehensive understanding of protein complexes. Effective utilization of these large-scale data has been validated useful in analyzing individual disease proteins or related complexes. For example, Lage et al. prioritized disease proteins based on a systematic analysis of human protein complexes comprising gene products implicated in many different categories of human disease [26]. Vanunu et al. provided a global network-based method for prioritizing disease proteins and inferring protein complex associations with a disease of interest [27]. Yang et al. proposed a technique for predicting disease proteins based on a constructed protein complex network [28]. Although these studies, together with early studies of individual disease proteins [29-36], have achieved remarkable successes, large-scale predictions and mechanistic explanations of disease-related complex still remain an open question. Considering that functional units are often protein complexes rather than individual proteins, we highlight the perspective of disease-related complexes rather than disease-related proteins to obtain an up-level investigation that may be one step closer to biological reality.

To this aim, we propose in this paper a computational method, MAXCOM, to prioritize candidate protein complexes. To optimize the relationship between a query disease and a protein complex, the maximum information flow (MIF) between them is calculated through a heterogeneous network that is constructed by using protein-protein interactions and disease phenotypic similarities. MAXCOM then prioritizes all candidate complexes by ranking the MIFs of them. We test, in a cross-validation setting, the utility of MAXCOM in prioritizing protein complex with at least one known gene. Results show that MAXCOM can recall higher proportion of complexes at top one against large randomly constructed negative controls. We also demonstrate the power of MAXCOM by studying the associations of breast cancer and SWI/SNF complex. We believe that our method and predictions provide a useful platform for initially investigating how protein complexes link their actions to development and homeostasis of human diseases.

Materials and methods

Workflow of MAXCOM

The prioritization of protein complexes is modelled as an optimization problem, in which the objective is to find the maximum information flow between a query disease and a candidate complex through a heterogeneous network. MAXCOM takes several steps to prioritize all candidate complexes to a query disease (Figure 1). First, a heterogeneous network is constructed by the disease phenotypic similarities, disease-gene associations and PPI interactions. Nodes of the network are defined as either diseases or proteins, while the capacities of edges are weighted as the phenotypic similarities among diseases or interactions among proteins. Second, in order to describe the relationship of a query disease and a protein complex, we add an extra sink with edges linked from each members of the complex to the sink. Third, after calculating the maximum information flow from the query disease to this sink, we obtain the maximum information flow (MIF) from the query disease through the nodes of a complex (Figure 1A). For all candidate protein complexes, maximum information flows are calculated, and the complexes are then ranked (Figure 1B). In the following parts, we describe the construction of heterogeneous network and the calculation of maximum information flows of candidate complexes.

thumbnailFigure 1. Workflow of MAXCOM. A. A heterogeneous network is constructed by combining disease similarity network, disease-gene associations and protein-protein interaction network (PPI). For a query disease and a set of candidate protein complexes, MAXCOM applies a maximum flow algorithm to calculate the maximum information flow (MIF) from the query to each complex. MIF of i-th complex is defined as <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M20">View MathML</a>, where <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M15">View MathML</a> is the protein number of complex <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M16">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M17">View MathML</a> is the flow value of j-th edge from j-th protein to sink node. B. Candidate complexes are ranked by the MIFs.

Construction of heterogeneous network

The heterogeneous network is composed of disease phenotypic similarities, disease-protein associations and protein-protein interactions. The phenotypic similarities were downloaded from the literature [37], including pairwise similarities for 5,080 disease. The similarity is ranged from 0 to 1, where a larger value means higher phenotypic similar between a disease pair and vice versa. The PPI network was extracted from the Human Protein Reference Database (HPRD, released in February 2013) [38], including 9,998 proteins and 41,049 interactions. The disease-protein associations were extracted from the Ensemble database by using the Biomart tool [39]. Focusing on the 5,080 diseases and proteins that can be mapped back to the HPRD database, we obtain a total of 1,962 associations between 1,548 diseases and 1,244 proteins. When constructing the heterogeneous network, all the 5,080 diseases and 9,998 proteins are taken as nodes. Edges are composed of the 41,049 interactions between proteins, the 1,962 disease-protein associations and the edges of disease pairs with nonzero similarities. To filter the small similarities that mean low confidences among disease pairs, we introduce a parameter <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a>to remove the edges that similarities are less than <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M2">View MathML</a>, the mean of all disease similarities. Existing studies have shown that relationships between diseases have noises [37], and thus a noise filtering process is helpful in improving the performance of detecting disease genes [33]. Finally, we obtain a heterogeneous network including 15,078 nodes and to 5,782,818 edges.

To optimize the relationship of a query disease and a complex, we modelled it as the MIF from the query disease node to the sink through all member proteins of the complex (Figure 1A). Here the heterogeneous network is served as a functional network that link diseases and proteins. The MIF is served to measure the value of functional relationship between a query disease and a candidate complex. Intuitively, if the query disease has stronger functional relationship to a candidate complex, the MIF between the disease and the complex will be larger than those the disease to other candidate complexes. For this modelling, a capacity that means the upper bound of connecting information flow is assigned to each edge of the heterogeneous network. In detail, the capacities of edges among diseases are assigned as the same as their phenotypic similarities. The capacities of edges among proteins (protein interactions) are assigned as 1. The capacities of edges among diseases and proteins (disease-protein associations) are assigned as infinite. We also add edges from each protein member of a complex to an additional sink node, and assign the capacities of these edges as infinite. By the capacity definition, if two nodes have a stronger functional relationship, the capacity of the edge between them is larger.

Calculation of maximum information flow

For the heterogeneous network <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M3">View MathML</a>, where <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M4">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M5">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M6">View MathML</a> representing the nodes, edge and nonnegative capacity on each edge respectively, the MIF from the query node to the sink through all the proteins of the complex is calculated by two steps. First, the MIF from the query node to the additional sink is calculated as follows.

<a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M7">View MathML</a>

(1)

<a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M8">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M9">View MathML</a>

where the information flow <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M10">View MathML</a> is defined as the flow value transmitted from node <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M11">View MathML</a> to node <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M12">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M13">View MathML</a> the capacity of the edge linked nodes <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M11">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M12">View MathML</a>.

Second, the MIF from the query to i-th complex is defined as <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M14">View MathML</a>, where <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M15">View MathML</a> is the protein number of complex <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M16">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M17">View MathML</a> is the flow value of j-th edge from j-th protein to the sink node. We use the HR_PR algorithm [40] to solve the problem (1). For all candidate complexes, the MIFs are then calculated and ranked.

Validation method and evaluation criteria

Leave-one-out cross-validation experiments are adopted to assess the capability of MAXCOM in identifying protein complexes that are associated with human diseases. For this purpose, we define a protein complex to be associated with a disease if at least one member protein of the complex has been annotated as associated with the disease. After mapping on 5,080 diseases and 9,998 proteins, a total of 539 disease-related protein complexes are collected from the CORUM database (released in February 2013) [41]. In each validation run, a test protein complex (a positive control) is selected and all the associations between the complex and diseases are deleted. The test protein complex is then ranked against a collection of negative control complexes. Two types of negative control complexes are used in each run of validations. First, 99 random protein complexes are collected as random control protein complexes. For each complex, same number proteins with the positive control are randomly selected from 9,998 proteins. Second, for a given protein complex, all the left 538 protein complexes are considered as negative controls that we named as real control protein complexes for convenient.

Three criteria are used to quantify the performances of MAXCOM. First, if a positive control complex is ranked at the top in a validation run, it is considered as a successful prediction. We calculate the top ranked ratio (TOP) as the number of all successful predictions divided by all validation runs. Second, we calculate the average rank of all positive controls and normalize it by the lengths of ranking lists to obtain a mean rank ratio (MRR). Third, given a threshold of the relative rank, we calculate the sensitivity (true positive rate) as the fraction of test protein complexes ranked above the threshold and the specificity (true negative rate) as the fraction of control protein complexes ranked below the threshold. A rank receiver operating characteristic curve (ROC) is then drawn by varying the threshold value from 0 to 1, and the area under this curve (AUC) is calculated. Obviously, larger TOP and AUC, as well as smaller MRR indicate higher performance.

Results

Performance of MAXCOM

To examine how well MAXCOM prioritizes candidate protein complexes, we assessed its capability of uncovering 539 protein complexes with known disease proteins by using the leave-one-out cross-validation experiments. For each of these protein complexes, we first generated 99 randomly constructed complexes as negative controls. By counting the number of test protein complexes with different ranking positions, we observed that 382 of all 539 test cases are ranked top one, achieving a TOP value of 70.87%. The mean rank ratio (MRR) was only 8.69% and a total of 412 test cases were ranked in top 5, suggesting a faster accumulation of top rankings (Figure 2A). The area (AUC) under the rank receiver operating characteristic curve was calculated as high as 91.33% (Figure 2B).

thumbnailFigure 2. Performance of MAXCOM. Histogram of ranks on random control protein complexes (A) and real control protein complexes (C). The rank receiver operating characteristic (ROC) curves on random control protein complexes (B) and real control protein complexes (D). The results were obtained by validating on normal network, 10% deleted or added networks, and randomly permutated network with same node distribution, respectively.

To simulate the real case in disease studies that user may want pinpoint known complexes for further biological validations, we performed a cross-validation on all 539 disease-related complexes. With a complex selected as positive control, the left 538 complexes were taken as negative controls. In this critical version, MAXCOM also exhibited a faster accumulation of top rankings (Figure 2C). For example, it achieved a TOP value of 15.03, and a high proportion as 30.61% in top 5. Its MRR and AUC were 37.71% and 84.25% (Figure 2D). Although these criteria were all dropped, the decrease was reasonable because the size of negative controls was more than 5.43 (538/99) fold compared that used as random control protein complexes. Thus, MAXCOM also achieved acceptable performances in pinpointing real protein complexes from a set of disease-related complexes and was suitable for large-scale predictions.

Robustness to network structure

The robustness of MAXCOM in operating potential noise in biological networks is of great important because much noise is widely observed in existing biological data [42,43]. The noise may lead to many negative protein-protein interactions in constructed network and affect the predicting precision. To demonstrate this issue, we employed several strategies to check the robustness of MAXCOM to network structure on both type of control sets. First, we randomly deleted 10% edges of the heterogeneous network. On random control protein complexes, MAXCOM achieved a TOP of 69.02%, an MRR of 10.02% and an AUC of 90.12%. The decreases in these same validation experiments were as small as 1.85% for TOP, 1.33% for MRR and 1.21% for AUC. On real control protein complexes, MAXCOM achieved a TOP of 12.62%, an MRR of 39.92% and an AUC of 80.42%. The decreases in these same validation experiments were as small as 2.41% for TOP, 2.21% for MRR and 3.83% for AUC.

Second, we randomly added 10% edges of the heterogeneous network. At this case, MAXCOM achieved a TOP of 70.5%, an MRR of 9.83% and an AUC of 90.16% on random control protein complexes. The decreases in these same validation experiments were as small as 0.37% for TOP, 1.14% for MRR and 1.17% for AUC. On real control protein complexes, MAXCOM achieved a TOP of 12.8%, an MRR of 38.56% and an AUC of 82.02%. The decreases in these same validation experiments were as small as 2.23% for TOP, 0.85% for MRR and 2.23% for AUC (Figure 2B, D). These two permutation validations suggested that MAXCOM was effective in dealing with false positive edges and shows robustness to network structures.

Third, validation experiments were also performed by shuffling edges in the heterogeneous network but fixing the degree distribution (i.e., the number of neighbours of each node). For this permutated network, the AUC scores were both reduced by approximately 50% on both control sets, while the result for the random control protein complexes was slightly higher as 57.34% (Figure 2B, D). This validation further indicated that MAXCOM could exploit the useful information in the heterogeneous network to prioritize the disease-related protein complexes.

Robustness to parameter

We also introduced a parameter <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a> to filter out the potential noise of disease similarities. In practice, threshold parameter <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a> played important functions not only in filtering out low confidence values among diseases to improve predicting precisions but also in making the heterogeneous network sparse to speed up running time. Here we changed it with a step as 0.05 to test its effect on MAXCOM (Table 1). If no any disease edges cut off (<a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M18">View MathML</a>), the TOP, MRR and AUC were 69.94%, 9.05% and 90.91%, respectively. With the increase of <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a>, best performance was achieved at <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M2">View MathML</a> as we had shown in above paragraphs. With continue increase of <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a>, most of criteria came to decrease, especially the TOP. Although these changes were observed, we noticed that changed ratios of three criteria were ranged only very slightly. For example, when <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a> changed from 0.1 to 0.4, the TOP changed from 70.87% to 55.29%, achieving a changed ratio of 21.98%. The MRR changed from 8.69% to 7.46%, and the changed ratio was 14.15%. Meanwhile, the AUC changed from 91.33% to 92.57%, achieving a little changed ratio of 1.36%. These results showed that <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a> was useful to improve the precision of MAXCOM by filtering noise (compared the case of <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M18">View MathML</a>), and confirmed that MAXCOM was robust to this parameter changing.

Table 1. Robustness of MAXCOM with respect to parameter <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a>.

The parameter <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a> also affected the number of edges in the heterogeneous network. When <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M18">View MathML</a>, there were total 10,174,820 edges in the network. The number was drastically decreased to 5,782,818 (<a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M2">View MathML</a>) and 154,692 (<a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M19">View MathML</a>). Thus, with the increase of <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a>, MAXCOM ran much faster in calculating. For example, when <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M18">View MathML</a>, the average calculating time of each run was 2.86 seconds. It was dropped to 1.57 and 0.18 seconds when <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a> is 0.1 and 0.4 respectively. For summary, <a onClick="popup('http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/8/S1/S2/mathml/M1">View MathML</a> was useful for filtering low confidence values among diseases and beneficial for improving performances and calculation time of MAXCOM.

Prediction of protein complexes associated with breast cancer

To demonstrate MAXCOM's ability in predicting novel disease-related complexes, we performed a case study of breast cancer (OMIM 114480), one of the most commonly occurring cancers. We systematically examined the top ten complexes that were prioritized through 539 candidates (Table 2). There were 58 proteins in these ten complexes, including 6 (BRCA1, TP53, KRAS, ATM, CDH1, RAD51) of 32 disease proteins reported in OMIM database [44]. We first preformed a functional enrichment analysis of these 58 proteins by using DAVID database [45,46]. Results showed that these proteins were mostly enriched in chromosome organization (p-value = 1.36e-15), chromatin modification/remodelling/organization (p-value = 7.32e-11) and protein complex biogenesis/assembly (p-value = 9.03e-10). This was consistent with the functional characterizations of the ten protein complexes that were manually annotated by CORUM database [41] (Table 2). Except for known disease proteins of breast cancer that found in the 6 protein complexes, many disease proteins that were associated with many other types of diseases could be found, with examples including E2F4, E2F5, HRAS, JUN, FOS. We also found that proteins (CDH1, CTNNB1, SMAD3, SMAD4, SMARCA4, SMARCC1, SMARCC2) were common in several complexes and all these complexes were connected by amounts of protein-protein interactions (Figure 3), suggesting tight functional relationships among these protein complexes. These results indicated that these complexes might serve as a large functional module involved in different stages of breast cancer.

Table 2. Predicted top ten protein complexes of breast cancer.

thumbnailFigure 3. Interactions of ten predicted protein complexes of breast cancer. The interactions are shown for 58 proteins of ten complexes. Six known genes associated with breast cancer are noted in red (CDH1, KRAS, BRAC1, ATM, RAD51, TP53). All these ten complexes are connected by protein-protein interactions among them (blue lines).

We then analyzed, in detail, the PBAF complex (SWI/SNF complex) since it did not include known disease proteins of breast cancer according to OMIM database (until Aug. 20, 2013) and was listed at last in our ten analyzed complexes. SWI/SNF complex was a multi-subunit chromatin-remodelling complex which mobilizes nucleosomes and remodel chromatin, playing key roles in control of lineage specification, gene expression and repression, metastasis, epigenetic tumor suppression. We found numerous literatures reported that SWI/SNF complex was associated a variety of cancers, including breast cancer. As inactivating mutations in several SWI/SNF subunits had recently been identified at a high frequency in a variety of cancers, a widespread role in tumour suppression had been proposed to SWI/SNF complex [17,47,48]. Actually, SWI/SNF had been demonstrated as the most frequently mutated chromatin-regulatory complex in human cancer, exhibiting a broad mutation pattern, similar to that of TP53 [18]. Here we predicted SWI/SNF in top positions as one of potential protein complexes that were involved in breast cancer. For summary, these proposed ten protein complexes were potentially involved in basic biological functions and agree well with current knowledge on breast cancer.

Discussion

With the explosion of large-scale "omics" data, computational methods of integrating these complex heterogeneous data can provide a more thorough and systemic analysis for characterizing disease related factors. Here we have proposed a network-based strategy to prioritize candidate protein complexes by integrating disease phenotypic similarities and protein-protein interactions. As analyzed in validation results, MAXCOM is useful in tracing relationships of diseases and complexes through the heterogeneous network. Compared with early works for prioritizing individual disease proteins [12,29,30], our work presents a computational tool to analysis disease related factors at an up functional level and close a step to mechanisms underling diseases.

Although MAXCOM is proved useful, some methodological improvements may be necessary in further research. An important extension is how to describe the tissue specificity. Since different cells have specific cellular functions such as regulation and expression [49], splicing and mehtylation [50], human PPIs and protein complexes in a tissue-specific context have been observed [51]. By utilizing these tissue-specific protein interactions, we may analyze protein complexes towards tissue-specific diseases. Another extension is to consider the "edge prioritization" that suggested in early literatures [12,52]. Instead of only prioritizing proteins or protein complexes in isolation, more attentions should be also devoted to potential interactions among top candidates. Here, we have shown that the top ten ranked protein complexes are functional associated, however a more comprehensive and systematic analysis of these top ranked candidates is desired. In general, this is especially important for following experimental validations, since the correlations of top ranked protein complexes may usually indicate a time and spatial cellular relationships. Third, the noise filtering is another highlight to be addressed. Considering that all the biological data are far from complete and full of noise, it is extremely useful to improve the precision by filtering noise before data integration. There are two different ways that can be used for this aim. The one is to filter low confidence data by parameters as used in our study, the other is by integrating more relevant types of biological information. For example, the relationships among proteins can be described in many types as co-expression, shared functional annotations, co-occurrence in literature and co-regulation [29,53-55]. These highly heterogeneous data contributed not only to inferring stronger relationships through the accumulation of evidence, but also providing broader coverage than any single data source.

Finally, MAXCOM could potentially be applied to find combinatorial protein targets and then help design network drugs. Here a disease is considered as the perturbations of the complex intracellular and intercellular network that links tissue and organ systems [56]. The ability of exploring molecular complexity of a particular disease at protein complex level will lead to the identification of the molecular relationships among distinct phenotypes. Thus, systematically predicting and analyzing disease-associated protein complexes could be useful for investigation of mechanisms underlying diseases, and could help to identify combinational drug targets and biomarkers.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RJ provided guidance and planning for the project. YC produced the program and wrote the manuscript, particularly producing the results section. YC, TJ and SZ contributed in preparing data and analysis of the results. All authors read and approved the final manuscript.

Acknowledgements

This work was partly supported by the National Basic Research Program of China (2012CB316504), the National High Technology Research and Development Program of China (2012AA020401), the National Natural Science Foundation of China (61175002, 60928007, and 61273228), the Open Research Fund of Shandong Provincial Key Laboratory of Network based Intelligent Computing, and the Open Research Fund of State Key Laboratory of Bioelectronics, Southeast University.

Declarations

Publication of this article was funded by the corresponding author.

This article has been published as part of BMC Systems Biology Volume 8 Supplement 1, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S1.

References

  1. Schadt EE: Molecular networks as sensors and drivers of common human diseases.

    Nature 2009, 461(7261):218-223. PubMed Abstract | Publisher Full Text OpenURL

  2. Zhao J, Lee SH, Huss M, Holme P: The network organization of cancer-associated protein complexes in human tissues.

    Scientific reports 2013, 3:1583. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Chairerg P, Tantavisut S, Tanavalee A, Tuangjaruwinai W, Panchaprateep R, Asawanonda P: Cast application of four weeks' duration significantly affects hair length, diameter and density.

    The Journal of dermatological treatment 2013. OpenURL

  4. Jiang R, Yang H, Sun F, Chen T: Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy.

    BMC bioinformatics 2006, 7:417. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  5. Jiang R, Yang H, Zhou L, Kuo CC, Sun F, Chen T: Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations.

    American journal of human genetics 2007, 81(2):346-360. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, et al.: A second generation human haplotype map of over 3.1 million SNPs.

    Nature 2007, 449(7164):851-861. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.

    Proceedings of the National Academy of Sciences of the United States of America 2009, 106(23):9362-9367. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Tang W, Wu X, Jiang R, Li Y: Epistatic module detection for case-control studies: a Bayesian model with a Gibbs sampling strategy.

    PLoS genetics 2009, 5(5):e1000464. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in case-control studies.

    BMC bioinformatics 2009, 10(Suppl 1):S65. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  10. Calin GA, Croce CM: MicroRNA signatures in human cancers.

    Nature reviews Cancer 2006, 6(11):857-866. PubMed Abstract | Publisher Full Text OpenURL

  11. Cheetham SW, Gruhl F, Mattick JS, Dinger ME: Long noncoding RNAs and the genetics of cancer.

    British journal of cancer 2013, 108(12):2419-2425. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Moreau Y, Tranchevent LC: Computational tools for prioritizing candidate genes: boosting disease gene discovery.

    Nature reviews Genetics 2012, 13(8):523-536. PubMed Abstract | Publisher Full Text OpenURL

  13. Portela A, Esteller M: Epigenetic modifications and human disease.

    Nature biotechnology 2010, 28(10):1057-1068. PubMed Abstract | Publisher Full Text OpenURL

  14. Santen GW, Aten E, Sun Y, Almomani R, Gilissen C, Nielsen M, Kant SG, Snoeck IN, Peeters EA, Hilhorst-Hofstee Y, et al.: Mutations in SWI/SNF chromatin remodeling complex gene ARID1B cause Coffin-Siris syndrome.

    Nature genetics 2012, 44(4):379-380. PubMed Abstract | Publisher Full Text OpenURL

  15. Tsurusaki Y, Okamoto N, Ohashi H, Kosho T, Imai Y, Hibi-Ko Y, Kaname T, Naritomi K, Kawame H, Wakui K, et al.: Mutations affecting components of the SWI/SNF complex cause Coffin-Siris syndrome.

    Nature genetics 2012, 44(4):376-378. PubMed Abstract | Publisher Full Text OpenURL

  16. Van Houdt JK, Nowakowska BA, Sousa SB, van Schaik BD, Seuntjens E, Avonce N, Sifrim A, Abdul-Rahman OA, van den Boogaard MJ, Bottani A, et al.: Heterozygous missense mutations in SMARCA2 cause Nicolaides-Baraitser syndrome.

    Nature genetics 2012, 44(4):445-449.

    S441

    PubMed Abstract | Publisher Full Text OpenURL

  17. Wilson BG, Roberts CW: SWI/SNF nucleosome remodellers and cancer.

    Nature reviews Cancer 2011, 11(7):481-492. PubMed Abstract | Publisher Full Text OpenURL

  18. Kadoch C, Hargreaves DC, Hodges C, Elias L, Ho L, Ranish J, Crabtree GR: Proteomic and bioinformatic analysis of mammalian SWI/SNF complexes identifies extensive roles in human malignancy.

    Nature genetics 2013, 45(6):592-601. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Santidrian AF, Matsuno-Yagi A, Ritland M, Seo BB, LeBoeuf SE, Gay LJ, Yagi T, Felding-Habermann B: Mitochondrial complex I activity and NAD+/NADH balance regulate breast cancer progression.

    The Journal of clinical investigation 2013, 123(3):1068-1081. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Kalaitzidis D, Sykes SM, Wang Z, Punt N, Tang Y, Ragu C, Sinha AU, Lane SW, Souza AL, Clish CB, et al.: mTOR complex 1 plays critical roles in hematopoiesis and Pten-loss-evoked leukemogenesis.

    Cell stem cell 2012, 11(3):429-439. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, et al.: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.

    Nature 2006, 440(7084):637-643. PubMed Abstract | Publisher Full Text OpenURL

  22. Babu M, Vlasblom J, Pu S, Guo X, Graham C, Bean BD, Burston HE, Vizeacoumar FJ, Snider J, Phanse S, et al.: Interaction landscape of membrane-protein complexes in Saccharomyces cerevisiae.

    Nature 2012, 489(7417):585-589. PubMed Abstract | Publisher Full Text OpenURL

  23. Michaut M, Baryshnikova A, Costanzo M, Myers CL, Andrews BJ, Boone C, Bader GD: Protein complexes are central in the yeast genetic landscape.

    PLoS computational biology 2011, 7(2):e1001092. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Guruharsha KG, Rual JF, Zhai B, Mintseris J, Vaidya P, Vaidya N, Beekman C, Wong C, Rhee DY, Cenaj O, et al.: A protein complex network of Drosophila melanogaster.

    Cell 2011, 147(3):690-703. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Kikugawa S, Nishikata K, Murakami K, Sato Y, Suzuki M, Altaf-Ul-Amin M, Kanaya S, Imanishi T: PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from h-invitational protein-protein interactions integrative dataset.

    BMC systems biology 2012, 6(Suppl 2):S7. BioMed Central Full Text OpenURL

  26. Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tumer Z, Pociot F, Tommerup N, et al.: A human phenome-interactome network of protein complexes implicated in genetic disorders.

    Nature biotechnology 2007, 25(3):309-316. PubMed Abstract | Publisher Full Text OpenURL

  27. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R: Associating genes and protein complexes with disease via network propagation.

    PLoS computational biology 2010, 6(1):e1000641. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Yang P, Li X, Wu M, Kwoh CK, Ng SK: Inferring gene-phenotype associations via global protein complex network propagation.

    PLoS One 2011, 6(7):e21502. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  29. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, et al.: Gene prioritization through genomic data fusion.

    Nature biotechnology 2006, 24(5):537-544. PubMed Abstract | Publisher Full Text OpenURL

  30. Wu X, Jiang R, Zhang MQ, Li S: Network-based global inference of human disease genes.

    Molecular systems biology 2008, 4:189. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  31. Wu X, Liu Q, Jiang R: Align human interactome with phenome to identify causative genes and networks underlying disease families.

    Bioinformatics 2009, 25(1):98-104. PubMed Abstract | Publisher Full Text OpenURL

  32. Wang W, Zhang W, Jiang R, Luan Y: Prioritisation of associations between protein domains and complex diseases using domain-domain interaction networks.

    IET systems biology 2010, 4(3):212-222. PubMed Abstract | Publisher Full Text OpenURL

  33. Chen Y, Jiang T, Jiang R: Uncover disease genes by maximizing information flow in the phenome-interactome network.

    Bioinformatics 2011, 27(13):i167-176. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  34. Zhang W, Sun F, Jiang R: Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach.

    BMC bioinformatics 2011, 12(Suppl 1):S11. BioMed Central Full Text OpenURL

  35. Zhang W, Chen Y, Sun F, Jiang R: DomainRBF: a Bayesian regression approach to the prioritization of candidate domains for complex diseases.

    BMC systems biology 2011, 5:55. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  36. Jiang R, Gan M, He P: Constructing a gene semantic similarity network for the inference of disease genes.

    BMC systems biology 2011, 5(Suppl 2):S2. BioMed Central Full Text OpenURL

  37. van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-mining analysis of the human phenome.

    European journal of human genetics: EJHG 2006, 14(5):535-542. PubMed Abstract | Publisher Full Text OpenURL

  38. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al.: Human Protein Reference Database--2009 update.

    Nucleic acids research 2009, 37(Database):D767-772. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  39. Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart--biological queries made easy.

    BMC genomics 2009, 10:22. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  40. Goldberg AV, Rao S: Beyond the flow decomposition barrier.

    Journal of the ACM (JACM) 1998, 45(5):783-797. Publisher Full Text OpenURL

  41. Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW: CORUM: the comprehensive resource of mammalian protein complexes--2009.

    Nucleic acids research 2010, 38(Database):D497-501. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  42. Pilpel Y: Noise in biological systems: pros, cons, and mechanisms of control.

    Methods Mol Biol 2011, 759:407-425. PubMed Abstract | Publisher Full Text OpenURL

  43. Ladbury JE, Arold ST: Noise in cellular signaling pathways: causes and effects.

    Trends in biochemical sciences 2012, 37(5):173-178. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  44. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.

    Nucleic acids research 2005, 33(Database):D514-517. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  45. Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

    Nature protocols 2009, 4(1):44-57. PubMed Abstract | Publisher Full Text OpenURL

  46. Huang da W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists.

    Nucleic acids research 2009, 37(1):1-13. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  47. Roberts CW, Orkin SH: The SWI/SNF complex--chromatin and cancer.

    Nature reviews Cancer 2004, 4(2):133-142. PubMed Abstract | Publisher Full Text OpenURL

  48. Euskirchen G, Auerbach RK, Snyder M: SWI/SNF chromatin-remodeling factors: multiscale analyses and diverse functions.

    The Journal of biological chemistry 2012, 287(37):30897-30905. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  49. Ong CT, Corces VG: Enhancer function: new insights into the regulation of tissue-specific gene expression.

    Nature reviews Genetics 2011, 12(4):283-293. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  50. Wan J, Oliver VF, Zhu H, Zack DJ, Qian J, Merbs SL: Integrative analysis of tissue-specific methylation and alternative splicing identifies conserved transcription factor binding motifs.

    Nucleic acids research 2013. OpenURL

  51. Ellis JD, Barrios-Rodiles M, Colak R, Irimia M, Kim T, Calarco JA, Wang X, Pan Q, O'Hanlon D, Kim PM, et al.: Tissue-specific alternative splicing remodels protein-protein interaction networks.

    Molecular cell 2012, 46(6):884-892. PubMed Abstract | Publisher Full Text OpenURL

  52. Zhong Q, Simonis N, Li QR, Charloteaux B, Heuze F, Klitgord N, Tam S, Yu H, Venkatesan K, Mou D, et al.: Edgetic perturbation models of human inherited disorders.

    Molecular systems biology 2009, 5:321. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  53. Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules.

    Science 2003, 302(5643):249-255. PubMed Abstract | Publisher Full Text OpenURL

  54. Ma X, Lee H, Wang L, Sun F: CGI: a new approach for prioritizing genes by combining gene expression and protein-protein interaction data.

    Bioinformatics 2007, 23(2):215-221. PubMed Abstract | Publisher Full Text OpenURL

  55. Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression.

    Nature genetics 2001, 28(1):21-28. PubMed Abstract | Publisher Full Text OpenURL

  56. Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease.

    Nature reviews Genetics 2011, 12(1):56-68. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL