BMC Bioinformatics

official impact factor 3.03

Open Access Highly Access Research article

An effective approach for identification of in vivo protein-DNA binding sites from paired-end ChIP-Seq data

Congmao Wang1, Jie Xu1, Dasheng Zhang1, Zoe A Wilson2 and Dabing Zhang1,3*

Author Affiliations

1 School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China

2 School of Biosciences, University of Nottingham, Sutton Bonington Campus, Loughborough, Leicestershire, LE12 5RD, UK

3 Bio-X Research Center, Key Laboratory of Genetics & Development and Neuropsychiatric Diseases, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China

For all author emails, please log on.

BMC Bioinformatics 2010, 11:81 doi:10.1186/1471-2105-11-81

Published: 9 February 2010

Abstract

Background

ChIP-Seq, which combines chromatin immunoprecipitation (ChIP) with high-throughput massively parallel sequencing, is increasingly being used for identification of protein-DNA interactions in vivo in the genome. However, to maximize the effectiveness of data analysis of such sequences requires the development of new algorithms that are able to accurately predict DNA-protein binding sites.

Results

Here, we present SIPeS (Site Identification from Paired-end Sequencing), a novel algorithm for precise identification of binding sites from short reads generated by paired-end solexa ChIP-Seq technology. In this paper we used ChIP-Seq data from the Arabidopsis basic helix-loop-helix transcription factor ABORTED MICROSPORES (AMS), which is expressed within the anther during pollen development, the results show that SIPeS has better resolution for binding site identification compared to two existing ChIP-Seq peak detection algorithms, Cisgenome and MACS.

Conclusions

When compared to Cisgenome and MACS, SIPeS shows better resolution for binding site discovery. Moreover, SIPeS is designed to calculate the mappable genome length accurately with the fragment length based on the paired-end reads. Dynamic baselines are also employed to effectively discriminate closely adjacent binding sites, for effective binding sites discovery, which is of particular value when working with high-density genomes.