Open Access Highly Accessed Methodology article

Identifying pathogenic processes by integrating microarray data with prior knowledge

Ståle Nygård123*, Trond Reitan4, Trevor Clancy5, Vegard Nygaard5, Johannes Bjørnstad36, Biljana Skrbic36, Theis Tønnessen36, Geir Christensen23 and Eivind Hovig578

Author Affiliations

1 Bioinformatics Core Facility, Institute for Medical Informatics, Oslo University Hospital, Oslo, Norway

2 Institute for Experimental Medical Research, Oslo University Hospital and University of Oslo, Oslo, Norway

3 KG Jebsen Cardiac Research Centre and Center for Heart Failure Research, University of Oslo, Oslo, Norway

4 Center for Ecological and Evolutionary Synthesis, Department of Biology, University of Oslo, Oslo, Norway

5 Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway

6 Department of Cardiothoracic Surgery, Oslo University Hospital, Oslo, Norway

7 Institute for Medical Informatics, Oslo University Hospital, Oslo, Norway

8 Department of informatics, University of Oslo, Oslo, Norway

For all author emails, please log on.

BMC Bioinformatics 2014, 15:115  doi:10.1186/1471-2105-15-115

Published: 24 April 2014



It is of great importance to identify molecular processes and pathways that are involved in disease etiology. Although there has been an extensive use of various high-throughput methods for this task, pathogenic pathways are still not completely understood. Often the set of genes or proteins identified as altered in genome-wide screens show a poor overlap with canonical disease pathways. These findings are difficult to interpret, yet crucial in order to improve the understanding of the molecular processes underlying the disease progression. We present a novel method for identifying groups of connected molecules from a set of differentially expressed genes. These groups represent functional modules sharing common cellular function and involve signaling and regulatory events. Specifically, our method makes use of Bayesian statistics to identify groups of co-regulated genes based on the microarray data, where external information about molecular interactions and connections are used as priors in the group assignments. Markov chain Monte Carlo sampling is used to search for the most reliable grouping.


Simulation results showed that the method improved the ability of identifying correct groups compared to traditional clustering, especially for small sample sizes. Applied to a microarray heart failure dataset the method found one large cluster with several genes important for the structure of the extracellular matrix and a smaller group with many genes involved in carbohydrate metabolism. The method was also applied to a microarray dataset on melanoma cancer patients with or without metastasis, where the main cluster was dominated by genes related to keratinocyte differentiation.


Our method found clusters overlapping with known pathogenic processes, but also pointed to new connections extending beyond the classical pathways.