Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Highly Accessed Research article

Systematic identification of conserved motif modules in the human genome

Xiaohui Cai1, Lin Hou23, Naifang Su2, Haiyan Hu4*, Minghua Deng2* and Xiaoman Li5*

Author Affiliations

1 Center for Research in Biological Systems, University of California, San Diego, La Jolla, CA, 92093, USA

2 School of Mathematical Sciences and Center for Theoretical Biology, Peking University, Beijing, 100871, China

3 State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing, 102206, China

4 School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA

5 Burnett School of Biomedical Science, University of Central Florida, Orlando, FL, 32816, USA

For all author emails, please log on.

BMC Genomics 2010, 11:567  doi:10.1186/1471-2164-11-567

Published: 14 October 2010

Abstract

Background

The identification of motif modules, groups of multiple motifs frequently occurring in DNA sequences, is one of the most important tasks necessary for annotating the human genome. Current approaches to identifying motif modules are often restricted to searches within promoter regions or rely on multiple genome alignments. However, the promoter regions only account for a limited number of locations where transcription factor binding sites can occur, and multiple genome alignments often cannot align binding sites with their true counterparts because of the short and degenerative nature of these transcription factor binding sites.

Results

To identify motif modules systematically, we developed a computational method for the entire non-coding regions around human genes that does not rely upon the use of multiple genome alignments. First, we selected orthologous DNA blocks approximately 1-kilobase in length based on discontiguous sequence similarity. Next, we scanned the conserved segments in these blocks using known motifs in the TRANSFAC database. Finally, a frequent pattern mining technique was applied to identify motif modules within these blocks. In total, with a false discovery rate cutoff of 0.05, we predicted 3,161,839 motif modules, 90.8% of which are supported by various forms of functional evidence. Compared with experimental data from 14 ChIP-seq experiments, on average, our methods predicted 69.6% of the ChIP-seq peaks with TFBSs of multiple TFs. Our findings also show that many motif modules have distance preference and order preference among the motifs, which further supports the functionality of these predictions.

Conclusions

Our work provides a large-scale prediction of motif modules in mammals, which will facilitate the understanding of gene regulation in a systematic way.