HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data
1 Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109-2029, USA
2 Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, 100 Washtenaw Avenue, Ann Arbor, MI 48109-2218, USA
3 Michigan Center for Translational Pathology, University of Michigan Medical School, 1400 E. Medical Center Drive, Ann Arbor, MI 48109-0940, USA
4 Department of Pathology, University of Michigan Medical School, 1301 Catherine, Ann Arbor, Michigan 48109-0602, USA
5 Division of Hematology/Oncology, Robert H. Lurie Comprehensive Cancer Center, Northwestern University, 676 North St. Clair, Suite 1200, Chicago, IL 60611, USA
6 Comprehensive Cancer Center, University of Michigan Medical School, 1500 E. Medical Center Drive, Ann Arbor, MI 48109-0944, USA
7 Department of Urology, University of Michigan Medical School, 1500 E. Medical Center Drive, Ann Arbor, MI 48109-0330, USA
8 Howard Hughes Medical Institute, 4000 Jones Bridge Road, Chevy Chase, MD 20815-6789, USA
BMC Bioinformatics 2010, 11:369 doi:10.1186/1471-2105-11-369Published: 2 July 2010
Protein-DNA interaction constitutes a basic mechanism for the genetic regulation of target gene expression. Deciphering this mechanism has been a daunting task due to the difficulty in characterizing protein-bound DNA on a large scale. A powerful technique has recently emerged that couples chromatin immunoprecipitation (ChIP) with next-generation sequencing, (ChIP-Seq). This technique provides a direct survey of the cistrom of transcription factors and other chromatin-associated proteins. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed to analyze the massive amount of data generated by this method.
Here we introduce HPeak, a Hidden Markov model (HMM)-based Peak-finding algorithm for analyzing ChIP-Seq data to identify protein-interacting genomic regions. In contrast to the majority of available ChIP-Seq analysis software packages, HPeak is a model-based approach allowing for rigorous statistical inference. This approach enables HPeak to accurately infer genomic regions enriched with sequence reads by assuming realistic probability distributions, in conjunction with a novel weighting scheme on the sequencing read coverage.
Using biologically relevant data collections, we found that HPeak showed a higher prevalence of the expected transcription factor binding motifs in ChIP-enriched sequences relative to the control sequences when compared to other currently available ChIP-Seq analysis approaches. Additionally, in comparison to the ChIP-chip assay, ChIP-Seq provides higher resolution along with improved sensitivity and specificity of binding site detection. Additional file and the HPeak program are freely available at http://www.sph.umich.edu/csg/qin/HPeak webcite.