BMC Bioinformatics

official impact factor 3.03

Open Access Research article

Determining significance of pairwise co-occurrences of events in bursty sequences

Niina Haiminen1*, Heikki Mannila1,2* and Evimaria Terzi3,4

Author Affiliations

1 HIIT, Department of Computer Science, P.O. Box 68, FI-00014 University of Helsinki, Finland

2 HIIT, Laboratory of Computer and Information Science, Helsinki University of Technology, FI-02015 TKK, Finland

3 IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA

4 This work was mostly done while the author was at HIIT, University of Helsinki, Finland

For all author emails, please log on.

BMC Bioinformatics 2008, 9:336 doi:10.1186/1471-2105-9-336

Published: 8 August 2008

Abstract

Background

Event sequences where different types of events often occur close together arise, e.g., when studying potential transcription factor binding sites (TFBS, events) of certain transcription factors (TF, types) in a DNA sequence. These events tend to occur in bursts: in some genomic regions there are more genes and therefore potentially more binding sites, while in some, possibly very long regions, hardly any events occur. Also some types of events may occur in the sequence more often than others.

Tendencies of co-occurrence of binding sites of two or more TFs are interesting, as they may imply a co-operative role between the TFs in regulatory processes. Determining a numerical value to summarize the tendency for co-occurrence between two TFs can be done in a number of ways. However, testing for the significance of such values should be done with respect to a relevant null model that takes into account the global sequence structure.

Results

We extend the existing techniques that have been considered for determining the significance of co-occurrence patterns between a pair of event types under different null models. These models range from very simple ones to more complex models that take the burstiness of sequences into account. We evaluate the models and techniques on synthetic event sequences, and on real data consisting of potential transcription factor binding sites.

Conclusion

We show that simple null models are poorly suited for bursty data, and they yield many false positives. More sophisticated models give better results in our experiments. We also demonstrate the effect of the window size, i.e., maximum co-occurrence distance, on the significance results.