Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON, M5S 3B2, Canada

Abstract

Background

Detection of false-positive motifs is one of the main causes of low performance in

Results

Using large-deviations theory, we derive a remarkably simple relationship that describes the dependence of false positives on dataset size for the one-occurrence per sequence motif-finding problem. As expected, we predict that false-positives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. Interestingly, we find that the false-positive strength depends more strongly on the number of sequences in the dataset than it does on the sequence length, but that the dependence on the number of sequences diminishes, after which adding more sequences does not reduce the false-positive rate significantly. We compare our theoretical predictions by applying four popular motif-finding algorithms that solve the one-occurrence-per-sequence problem (MEME, the Gibbs Sampler, Weeder, and GIMSAN) to simulated data that contain no motifs. We find that the dependence of false positives detected by these softwares on the motif-finding parameters is similar to that predicted by our formula.

Conclusions

We quantify the relationship between the sequence search space and motif-finding false-positives. Based on the simple formula we derive, we provide a number of intuitive rules of thumb that may be used to enhance motif-finding results in practice. Our results provide a theoretical advance in an important problem in computational biology.

Background

Because binding of sequence specific transcription factors to their recognition sites in non-coding DNA is an important step in the control of gene expression, the development of computational methods to identify transcription factor binding motifs in non-coding DNA has received much attention in computational biology

An even more challenging computational problem is the

One explanation for these observations could be that the low information content of DNA binding sites places limits on this problem as well - an extension of the Futility Theorem

Here we argue that ‘false positive motifs’, i.e., patterns similar to typical biological motifs, may be likely to arise due to the statistical nature of large sequence data sets. In other words, when the dataset is large enough, motifs with strength similar to real transcription factor binding motifs begin to occur by chance. Consistent with this idea, it is frequently observed that DNA motif-finders identify seemingly strong candidate motifs, even when randomly chosen sequences are provided as the input. This issue has been previously recognized

The prevalence of such false positive motifs in DNA motif-finding has led to substantial research to assess the statistical significance of motifs. It is important to distinguish three distinct types of research in this area. The first aims to calculate of p-values for matches to a given motif (e.g.,

The second and third types of research are closely related, and were both treated in the seminal work of Hertz & Stormo

Hertz & Stormo

In practice, significance of motifs identified through motif-finding is often obtained through simulations (e.g.

Here, we obtain a remarkably simple analytical relationship between the size of the sequence search space and the strength of the false-positive motifs (we provide a definition for the strength of a motif below). In particular, we use Sanov’s theorem

Since we have considered the underlying statistics of the one-occurrence-per-sequence motif-finding problem, our results should apply to any motif-finding method that attempts to solve this problem. We confirmed this with softwares that implement different optimization approaches, MEME

Results

A bound on the p-value of a motif

We first consider the problem of assigning a p-value to a motif (or ungapped multiple alignment). The patterns in DNA sequence families (called motifs) can be represented by position weight matrices (PWMs), in which each column specifies the distribution of the DNA letters

where _{jk} is the relative frequency of base _{k} is the background distribution of base

DNA motif finding problem parameters

**DNA motif finding problem parameters.** In this example,

Under the null hypothesis (of randomly generated sequences) a PWM with

where

**Appendix. Proof of the main theorem****].** (PDF 346 kb) (PDF 306 kb)

Click here for file

**Figure S1**. The comparison between bound on the p-value (Eq. 2) and the p-value computed by an FFT-based method. Figure S2. Theoretical bound on sequence length compared with MEME results. Figure S3. Theoretical bound on sequence length compared with GIMSAN results. (PDF 225 kb)

Click here for file

Theoretical bounds on false positives in de novo motif finding

We now turn to our main focus, which is the problem of false positives in

Our main theoretical results are as follows. If the sequence length (

Here |

Furthermore, when one or more motifs are expected to occur by chance with strength

Thus, our theory predicts that when false positives occur, their strength will depend differently on each of the motif finding parameters

To obtain these results, we have followed Hertz & Stormo, and assumed that the ideal motif-finder has tested all (^{n} possible motifs. Please see Appendix A (

False positives are predicted to arise in realistic motif-finding scenarios

We next sought to test whether the typical dataset sizes used for DNA motif-finding are likely to produce false positives according the formula above. Figure

Theoretical bound on sequence length compared with results from MEME

**Theoretical bound on sequence length compared with results from MEME.** Theoretical bound sequence length,

We note that the bound on false positives (predicted by Eq. 3) depends more strongly on

The relationship between false-positive information content and the number of sequences

**The relationship between false-positive information content and the number of sequences.** The figure shows the theoretical upper bound on the information content threshold,

Finally, the upper bound on false-positive strength threshold,

The relationship between false-positive information content and the motif width

**The relationship between false-positive information content and the motif width.** The figure shows the theoretical upper bound on the information content threshold,

MEME, the Gibbs sampler, GIMSAN, and weeder performance is qualitatively consistent with the theoretical expectations

To confirm our theoretical results, we conducted a series of experiments with four popular motif finding softwares: MEME

We first performed extensive simulations with the MEME software because it allows the user to specify the parameters of the motif-finding problem, such as the width of the motif and the one-occurrence-per-sequence assumption. This allows us to directly compare our theoretical predictions of the dependence of false positives on the motif finding parameters to the observed false positives (Eq. 4). The results from MEME qualitatively follow the theoretical prediction (Figures

Since our theory is based only on the statistics of random sequences, it should be applicable to any motif finder that solves the one-occurrence-per-sequence motif finding problem, regardless of the algorithm used for optimization. To test this, we compared the strength of each false positive motif discovered by MEME and the Gibbs Sampler to the bound predicted by Eq. 4. For both MEME and the Gibbs Sampler, we found similar agreement between the observed false positives and the theoretical bound (R^{2} > 0.85, Figure

Comparison of theoretical bound and observed false-positive motif strengths

**Comparison of theoretical bound and observed false-positive motif strengths.** The strengths of observed false positive motifs identified by **a**) MEME, **b**) the Gibbs Sampler, **c**) GIMSAN and **d**) Weeder show reasonable accordance with our theoretical bound. Each cross represents one false positive motif, while the dashed line represents ‘

We also tested GIMSAN because of its unique approach for computing p-values based on the estimation of the null distribution for motifs. We asked GIMSAN to find motifs with widths (^{2} = 0.83, Figure

We note that most

Because Weeder does not allow the user to specify the width of the motif or the number of motif instances that each sequence will contain, we simply ran it repeatedly on random sequence sets of various sizes and identified false positive motifs (See Methods for more details). To compare the strength of the false positive motifs to the predicted bound on strength of these motifs based on our theoretical results, we defined ^{2} = 0.60). That the Weeder results show such good agreement with our predictions is somewhat surprising, as Weeder violates the assumptions we made in deriving Eq. 2. This suggests that our theoretical results may be quite robust to the assumptions made in the motif finding procedure (see Discussion).

For all four motif-finders, the false positives identified tend to be weaker than the theory predicts (Figure

Discussion

We used large-deviations theory to approximate the relationship between false positives and the parameters of the one-occurrence per sequence de novo DNA motif-finding problem. A similar approach has been previously proposed to quantify the so-called twilight zone

We note that the situation we considered is where each position in the DNA sequence is considered to be drawn from a background distribution

Simple rules of thumb for DNA motif finding

To reduce the false-positive strength in experimental design, it is generally desired to move towards weaker false-positive motifs. The theoretical predictions provide intuition about how to adjust motif-finding parameters to reduce the strengths of motifs that are due to chance (using Eq. 4 or using the curves in Figures

· As it is intuitively expected, it is generally preferred to use shorter sequences (when it is biologically plausible) to avoid false-positives.

· Adding more sequences to the dataset reduces the false-positive rate considerably (e.g. using 30 sequences compared to 10 reduces the false-positive motif strengths by more than 6 bits (~25%) for

· The dependency of false-positives (the strength of false-positive motifs) on

· For a given information content, the detection of motifs with smaller width is less prone to false-positives. Therefore, to avoid false positives, it is generally preferred to choose the smallest possible width that adequately summarizes the biological motif.

Examples of applications

In using the theoretical results in Eq. 3 or the graphs in Figure

It can be seen from Figure

Examples of applications of the theoretical results

**Examples of applications of the theoretical results.** Two real motifs of width

Comparison of false positives from different motif finders

To test whether our results were applicable beyond the one-occurrence per sequence setting, in addition to MEME and the Gibbs Sampler, we tested Weeder, a non-probabilistic motif-finder that implements a consensus-based search. We found that the theoretical relationship held quite well for the false-positives produced by Weeder, suggesting that the simple formula we obtained will be quite generally applicable, or that heuristic post-processing steps in Weeder (implemented by the so-called “advisor” program) to reduce the false-positives (by removing the highest scored motifs that do not qualify a redundancy criteria, see

Regardless of their generality, our theoretical results quantify the limit to how well we can expect even the ideal motif-finder to perform. This will be useful to future benchmarking studies, so they can take into account whether the ‘real’ motif in test cases is strong enough to be distinguished from false positives that spontaneously arise.

Conclusions

We have derived a remarkably simple formula to describe the relationship between false positive strength and dataset size in the one-occurrence per sequence DNA motif finding problem, and confirmed it using simulations. We conclude that false positives in

Methods

Simulations

In each experiment, we generated a set of

We presented each dataset as input to the softwares. For each detected motif, we computed the information content or divergence,

Particular notes for each software are as follows:

MEME: We ran MEME using OOPS model (one occurrence per sequence) using parameter (−m oops) and restricted MEME to generate only one motif (the most significant) with widths

Gibbs Sampler: we used the “site sampler” model that restricts the software to include in the PWM only one occurrence of the motif in each sequence and with widths

GIMSAN: we used the OOPS model and considered motifs with widths

WEEDER: We ran Weeder on the random datasets using the “large” parameter. Because Weeder does not allow the user to specify the width of the motif (

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AMM and AZ designed the study. AZ performed all research. AMM supervised the research. AMM and AZ wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

The authors acknowledge Alex Nguyen Ba for helpful discussions, Dr. Christian Seis for mathematical advice and the Associate Editor and an anonymous reviewer for numerous helpful suggestions. This research was supported by Canadian Institute for Health Research grant #202372 and an infrastructure grant from the Canadian Foundation for Innovation to AMM.