Open Access Highly Accessed Methodology article

Correlation set analysis: detecting active regulators in disease populations using prior causal knowledge

Chia-Ling Huang1, John Lamb3, Leonid Chindelevitch2, Jarek Kostrowicki3, Justin Guinney4, Charles DeLisi1 and Daniel Ziemek2*

Author Affiliations

1 Bioinformatics Graduate Program, and Department of Biomedical Engineering, Boston University, 44 Cummington Street, Boston, MA 02215, USA

2 Computational Sciences Center of Emphasis, Worldwide Research & Development, Pfizer, 35 Cambridgepark Drive, Cambridge, MA 02140, USA

3 Oncology Research Unit, Worldwide Research & Development, Pfizer, 10646 Science center Drive, San Diego, CA 92121, USA

4 Sage Bionetworks, 1100 Fairview Ave North, Seattle, WA 98109, USA

For all author emails, please log on.

BMC Bioinformatics 2012, 13:46  doi:10.1186/1471-2105-13-46

Published: 23 March 2012



Identification of active causal regulators is a crucial problem in understanding mechanism of diseases or finding drug targets. Methods that infer causal regulators directly from primary data have been proposed and successfully validated in some cases. These methods necessarily require very large sample sizes or a mix of different data types. Recent studies have shown that prior biological knowledge can successfully boost a method's ability to find regulators.


We present a simple data-driven method, Correlation Set Analysis (CSA), for comprehensively detecting active regulators in disease populations by integrating co-expression analysis and a specific type of literature-derived causal relationships. Instead of investigating the co-expression level between regulators and their regulatees, we focus on coherence of regulatees of a regulator. Using simulated datasets we show that our method performs very well at recovering even weak regulatory relationships with a low false discovery rate. Using three separate real biological datasets we were able to recover well known and as yet undescribed, active regulators for each disease population. The results are represented as a rank-ordered list of regulators, and reveals both single and higher-order regulatory relationships.


CSA is an intuitive data-driven way of selecting directed perturbation experiments that are relevant to a disease population of interest and represent a starting point for further investigation. Our findings demonstrate that combining co-expression analysis on regulatee sets with a literature-derived network can successfully identify causal regulators and help develop possible hypothesis to explain disease progression.