Open Access Research article

A model-based circular binary segmentation algorithm for the analysis of array CGH data

Fang-Han Hsu1, Hung-I H Chen2, Mong-Hsun Tsai4, Liang-Chuan Lai5, Chi-Cheng Huang16, Shih-Hsin Tu6, Eric Y Chuang1* and Yidong Chen23*

Author Affiliations

1 Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei 106, Taiwan

2 Greehey Children's Cancer Research Institute, The University of Texas Health Science Center at San Antonio, San Antonio, TX 78229, USA

3 Department of Epidemiology and Biostatistics, The University of Texas Health Science Center at San Antonio, San Antonio, TX 78229, USA

4 Institute of Biotechnology, Center for Systems Biology and Bioinformatics, National Taiwan University, Taipei 106, Taiwan

5 Graduate Institute of Physiology, National Taiwan University, Taipei 100, Taiwan

6 Cathy General Hospital, Taipei 106, Taiwan

For all author emails, please log on.

BMC Research Notes 2011, 4:394  doi:10.1186/1756-0500-4-394

Published: 10 October 2011

Abstract

Background

Circular Binary Segmentation (CBS) is a permutation-based algorithm for array Comparative Genomic Hybridization (aCGH) data analysis. CBS accurately segments data by detecting change-points using a maximal-t test; but extensive computational burden is involved for evaluating the significance of change-points using permutations. A recent implementation utilizing a hybrid method and early stopping rules (hybrid CBS) to improve the performance in speed was subsequently proposed. However, a time analysis revealed that a major portion of computation time of the hybrid CBS was still spent on permutation. In addition, what the hybrid method provides is an approximation of the significance upper bound or lower bound, not an approximation of the significance of change-points itself.

Results

We developed a novel model-based algorithm, extreme-value based CBS (eCBS), which limits permutations and provides robust results without loss of accuracy. Thousands of aCGH data under null hypothesis were simulated in advance based on a variety of non-normal assumptions, and the corresponding maximal-t distribution was modeled by the Generalized Extreme Value (GEV) distribution. The modeling results, which associate characteristics of aCGH data to the GEV parameters, constitute lookup tables (eXtreme model). Using the eXtreme model, the significance of change-points could be evaluated in a constant time complexity through a table lookup process.

Conclusions

A novel algorithm, eCBS, was developed in this study. The current implementation of eCBS consistently outperforms the hybrid CBS 4× to 20× in computation time without loss of accuracy. Source codes, supplementary materials, supplementary figures, and supplementary tables can be found at http://ntumaps.cgm.ntu.edu.tw/eCBSsupplementary webcite.