Email updates

Keep up to date with the latest news and content from BMC Research Notes and BioMed Central.

Open Access Research article

Effect of positional dependence and alignment strategy on modeling transcription factor binding sites

Saad Quader and Chun-Hsi Huang*

Author Affiliations

Department of Computer Science & Engineering, University of Connecticut, 371 Fairfield Road, Storrs, Unit 2155, Connecticut 06269-2155, USA

For all author emails, please log on.

BMC Research Notes 2012, 5:340  doi:10.1186/1756-0500-5-340

Published: 2 July 2012

Abstract

Background

Many consensus-based and Position Weight Matrix-based methods for recognizing transcription factor binding sites (TFBS) are not well suited to the variability in the lengths of binding sites. Besides, many methods discard known binding sites while building the model. Moreover, the impact of Information Content (IC) and the positional dependence of nucleotides within an aligned set of TFBS has not been well researched for modeling variable-length binding sites. In this paper, we propose ML-Consensus (Mixed-Length Consensus): a consensus model for variable-length TFBS which does not exclude any reported binding sites.

Methods

We consider Pairwise Score (PS) as a measure of positional dependence of nucleotides within an alignment of TFBS. We investigate how the prediction accuracy of ML-Consensus is affected by the incorporation of IC and PS with a particular binding site alignment strategy. We perform cross-validations for datasets of six species from the TRANSFAC public database, and analyze the results using ROC curves and the Wilcoxon matched-pair signed-ranks test.

Results

We observe that the incorporation of IC and PS in ML-Consensus results in statistically significant improvement in the prediction accuracy of the model. Moreover, the existence of a core region among the known binding sites (of any length) is witnessed by the pairwise coexistence of nucleotides within the core length.

Conclusions

These observations suggest the possibility of an efficient multiple sequence alignment algorithm for aligning TFBS, accommodating known binding sites of any length, for optimal (or near-optimal) TFBS prediction. However, designing such an algorithm is a matter of further investigation.