Open Access Highly Accessed Open Badges Research article

Inferring causal genomic alterations in breast cancer using gene expression data

Linh M Tran12, Bin Zhang12*, Zhan Zhang2, Chunsheng Zhang2, Tao Xie2, John R Lamb2, Hongyue Dai2, Eric E Schadt123* and Jun Zhu12*

Author Affiliations

1 Sage Bionetworks, Seattle, WA 98109, USA

2 Merck Research Laboratories, Merck & Co., Inc., 33 Avenue Louis Pasteur, Boston, MA 02115, USA

3 Pacific Biosciences, 1505 Adams Drive, Menlo Park, California 94025, USA

For all author emails, please log on.

BMC Systems Biology 2011, 5:121  doi:10.1186/1752-0509-5-121

Published: 1 August 2011



One of the primary objectives in cancer research is to identify causal genomic alterations, such as somatic copy number variation (CNV) and somatic mutations, during tumor development. Many valuable studies lack genomic data to detect CNV; therefore, methods that are able to infer CNVs from gene expression data would help maximize the value of these studies.


We developed a framework for identifying recurrent regions of CNV and distinguishing the cancer driver genes from the passenger genes in the regions. By inferring CNV regions across many datasets we were able to identify 109 recurrent amplified/deleted CNV regions. Many of these regions are enriched for genes involved in many important processes associated with tumorigenesis and cancer progression. Genes in these recurrent CNV regions were then examined in the context of gene regulatory networks to prioritize putative cancer driver genes. The cancer driver genes uncovered by the framework include not only well-known oncogenes but also a number of novel cancer susceptibility genes validated via siRNA experiments.


To our knowledge, this is the first effort to systematically identify and validate drivers for expression based CNV regions in breast cancer. The framework where the wavelet analysis of copy number alteration based on expression coupled with the gene regulatory network analysis, provides a blueprint for leveraging genomic data to identify key regulatory components and gene targets. This integrative approach can be applied to many other large-scale gene expression studies and other novel types of cancer data such as next-generation sequencing based expression (RNA-Seq) as well as CNV data.

breast cancer; copy number variation; gene regulatory networks; oncogenes