Abstract
Background
Improving the accuracy and efficiency of motif recognition is an important computational challenge that has application to detecting transcription factor binding sites in genomic data. Closely related to motif recognition is the CONSENSUS STRING decision problem that asks, given a parameter d and a set of ℓlength strings S = {s_{1}, ..., s_{n}}, whether there exists a consensus string that has Hamming distance at most d from any string in S. A set of strings S is pairwise bounded if the Hamming distance between any pair of strings in S is at most 2d. It is trivial to determine whether a set is pairwise bounded, and a set cannot have a consensus string unless it is pairwise bounded. We use CONSENSUS STRING to determine whether or not a pairwise bounded set has a consensus. Unfortunately, CONSENSUS STRING is NPcomplete. The lack of an efficient method to solve the CONSENSUS STRING problem has caused it to become a computational bottleneck in MCLWMR, a motif recognition program capable of solving difficult motif recognition problem instances.
Results
We focus on the development of a method for solving CONSENSUS STRING quickly with a small probability of error. We apply this heuristic to develop a new motif recognition program, sMCLWMR, which has impressive accuracy and efficiency. We demonstrate the performance of sMCLWMR in detecting weak motifs in large data sets and in real genomic data sets, and compare the performance to other leading motif recognition programs. In our preliminary discussion of our CONSENSUS STRING algorithm we give insight into the issue of sampling pairwise bounded sets, and discuss its relevance to motif recognition.
Conclusion
Our novel heuristic gives birth to a state of the art program, sMCLWMR, that is capable of detecting weak motifs in data sets with a large number of strings. sMCLWMR is orders of magnitude faster than its predecessor MCLWMR and is capable of solving previously unsolved synthetic motif recognition problems. Lastly, sMCLWMR shows impressive accuracy in detecting transcription factor binding sites in the genomic data and used in the assessment of Tompa et al.
Background
Given a number of DNA strings, motif recognition is the task of discovering similar substrings without prior knowledge of their consensus or their locations. The following is a combinatorial formulation of the (ℓ, d)motif problem [1]: let S = {s_{1}, ..., s_{n}} be a set of mlength strings, and s* be the consensus string, a fixed and unknown string of length ℓ that is contained in each s_{i }as a substring but is corrupted with at most d substitutions (point mutations). The aim is to determine s* and the location of the motif instances in each string. The weak motif recognition problem is to find the motif instances when the number of degenerate positions d is large in relation to the motif length ℓ; wellknown weak motif recognition problems exist when the parameters (ℓ, d) are equal to (11, 3), (15, 4), and (18, 6). This combinatorial problem has application to finding transcription factor binding sites in genomic data [2].
Motif recognition is NPcomplete and therefore cannot be solved in polynomial time unless P = NP [3]. Nonetheless, there are numerous algorithms developed to solve specific instances of the problem, including PROJECTION [4], Winnower [1], pattern driven approaches [5], MITRA [6], PSM1 [7], PMSprune [8], the Voting algorithm [9], MCLWMR [10], MEME [11], VAS [12], RISOTTO [13], Weeder [14] and several others. Li et al. proved the existence of a PTAS for an optimization version of the motif recognition problem, though the high degree in the polynomial complexity of the PTAS algorithm renders this result only of theoretical interest [15].
Closely related to motif recognition is the CONSENSUS STRING decision problem. A consensus string for a set S of strings has Hamming distance at most d from all strings in S. CONSENSUS STRING asks, given a parameter d and a set S = {s_{1}, ..., s_{n}} of n strings, each of length ℓ, whether there exists a consensus string for S. CONSENSUS STRING is NPcomplete even when interest is limited to the binary alphabet [16].
For a given parameter d we say S is a motif set if there exists a consensus string s* at distance at most d from any string in S; we say a set S of strings is pairwise bounded if the distance between any pair of strings in S is at most 2d. Every motif set is pairwise bounded; if a pairwise bounded set is not a motif set we say it is a decoy set. For example, for d = 1 the set {000, 001, 010, 100} is a motif set because 000 is a consensus string for this set. In contrast, the set {000, 011, 101, 110} is a decoy set because it is pairwise bounded (since any two of the strings are at Hamming distance 2) but no consensus string exists.
The focus of this paper is the development and application of a heuristic for the CONSENSUS STRING decision problem (also known as the RADIUS DECISION problem [16]). We denote the Hamming distance between any pair of strings s_{i }and s_{j }as H(s_{i}, s_{j}). We define the weight of a set of strings S as the sum of the Hamming distances of each pair of strings in S (i.e. Σ_{1 ≤ i ≤ j ≤ n }H(s_{i}, s_{j})). If the weight of a set, which can be calculated in polynomial time, can be used to indicate whether it is a motif set or a decoy set then CONSENSUS STRING can be solved extremely efficiently and accurately in practicesimply calculate the weight of the pairwise bounded set and decide whether the set has a consensus based on this value. For this heuristic to work we need to know how the respective weights of a random motif set and a random decoy set are distributed. Further, the distributions need to be adequately separated so that the weight of a set leaves little ambiguity as to whether the set is a motif set or a decoy set.
There exists an algorithm to sample from the set of all motif sets: simply choose any ℓlength string as the consensus sequence and sample with replacement from the set of all strings that are at distance at most d from that sequence [10]. Unfortunately we do not know an analogous sampling algorithm, either exact or approximate, for decoy sets. If we could sample pairwise bounded sets uniformly then we could learn the probability distribution of the weight of a random decoy set.
We give a method to generate pairwise bounded sets uniformly, use this method to determine the probability distribution of the weight of a random decoy set, and show the existence of a separation between this distribution and the probability distribution of the weight of a random motif set. Thus, we solve CONSENSUS STRING instances extremely accurately and efficiently using the simple heuristic of using the weight as an indicator as to whether a pairwise bounded set is a motif set or a decoy set. The separation of the distributions becomes increasingly more prevalent as the number of strings in the set (i.e. the parameter n) increases, so the accuracy of our method increases as the number of strings increases. We significantly extend our earlier motif recognition program, MCLWMR [10], by incorporating the heuristic for CONSENSUS STRING described in this paper. This new algorithm, referred to as sMCLWMR, detects motifs in data sets with a large number of strings (i.e. 30 or more strings), and finds regulatory strings in genomic data. sMCLWMR represents the input data as a weighted graph and uses graph clustering to narrow the search to smaller problems that can be solved with significantly less computation. An efficient refinement algorithm that distinguishes valid motif sets from decoy sets allows sMCLWMR to detect motifs in very large data sets in significantly less computational time than MCLWMR.
Methods
Sampling pairwise bounded sets
In this section we discuss uniform sampling, or generation, of pairwise bounded sets. A standard method used to generate a random motif set is to choose an ℓlength string u.a.r. (uniformly at random) from all possible 4^{ℓ }strings to be the consensus string, and then form a motif set by selecting n strings at random with replacement from the set of all strings with Hamming distance at most d from this consensus string [4,10]. This does not sample motif sets uniformly, but rather samples a motif set with probability proportional to the number of distinct consensus strings it has and thus, corresponds to how synthetic problem data sets are constructed and how we expect meaningful motif sets arise in nature. For example, synthetic problem instances are traditionally generated as follows: a random consensus string of length ℓ is chosen, n occurrences of the motif are generated by randomly mutating at most d positions, and each of the n motif instances is embedded at a random location into a different background string of length m. We note that other nonuniform distributions have also been used to generate motif sets [1].
When sampling uniformly from a poorly understood sample set, rejection sampling is a naïve but useful technique. If we can find a superset of the target set that is easy to sample from uniformly, we can sample from this superset and simply throw away (reject) any sampled element that is not in the target set. We show how rejection sampling can be applied to generate pairwise bounded sets uniformly.
Uniform sampling of pairwise bounded sets
To sample u.a.r. from all pairwise bounded sets using rejection sampling in the most naïve way, we would generate n random ℓlength strings and accept the set if it is pairwise bounded, and reject and repeat otherwise (technically this samples uniformly from pairwise bounded sequences since the order of the strings matters in a sequence). However, since it is unlikely that such a randomly generated set would be pairwise bounded, this method is extremely inefficient. We introduce a heuristic to generate random sets that are more likely to be pairwise bounded, thus speeding up the rejection sampling process enough to be practical.
We generate the first string, s_{1}, u.a.r. from the set of all ℓlength strings then generate each of s_{2}, ..., s_{n }in turn u.a.r. from the set of all strings at distance at most 2d from s_{1}. This gives us a set of strings generated u.a.r. from the set of all strings that have s_{1 }as the first string and each other string at distance at most 2d from s_{1}. If the set is pairwise bounded we keep it; if it is not we reject it and start over. The fact that this method generates pairwise bounded sequences u.a.r. can be verified by induction on n. The number of times a set of n strings is considered and rejected until a pairwise bounded set is generated follows a geometric distribution and therefore, the efficiency of this method is determined by the probability that a set is rejected. Though this method is fast enough to work in practice for values of n we are interested in, the expected runtime when generating a single pairwise bounded set grows exponentially with n.
Proposition 1. The probability that a set generated using rejection sampling is pairwise bounded decreases at least exponentially fast as a function of n.
Proof. For 1 ≤ i ≤ n let S_{i }be the subset of S containing the first i randomly chosen strings, with S_{n }= S. Let A_{i }be the event that S_{i }is pairwise bounded. Any subset of a pairwise bounded set is pairwise bounded, so A_{i }implies A_{i1 }for 2 ≤ i ≤ n. Therefore by Bayes' law we have ℙ[A_{i}] = ℙ[A_{i}A_{i1}] ℙ[A_{i1}]. To prove that ℙ[A_{n}] decays exponentially with n we need only show that ℙ[A_{i}A_{i1}] is nonincreasing in i, since it can easily be verified to be strictly less than 1 for i = 3. Let K_{i }be the set of strings such that S_{i }∪ {s} is pairwise bounded if and only if s ∈ K_{i}, noting that K_{i }= ∅ if S_{i }is not pairwise bounded. We have K_{j }⊆ K_{i }for any 1 ≤ i <j ≤ n. Since , where B(2d) is the number of strings at distance at most 2d from s_{1}, the result holds. □
To empirically evaluate the efficiency of our rejection sampling method we determined the portion of sets that will be rejected when generating a sample (of specified size) of pairwise bounded sets. We performed experiments with varying values of n, ℓ, and d, generated 10000 pairwise bounded sets in each experiment, and considered the average number of sets rejected before the pairwise bounded set was obtained. The default values for (n, ℓ, d) are (20, 15, 4).
The results of the empirical tests are shown in Figure 1. Each of the three plots shows how the average number of rejected sets changes when one of the three parameters is varied and the other two are fixed at their default values. The left plot shows what happens when d varies between 1 to 7. For values of d that are either greater than ⌊ℓ/2⌋ or equal to 0, any set we generate is pairwise bounded and hence, we did not plot data for d = 0 or d ≥ 8. The average number of rejected sets is largest when d is equal to 2 and decreases dramatically as d increases. This trend is expected since a large portion of nonpairwise bounded sets would be rejected when d is moderately large. The middle plot shows what happens when ℓ is varied between 9 and 55. The number of rejected sets increases steadily when ℓ varies within the range [9,20], then plateaus when ℓ is above 20. It can be easily shown analytically that increasing ℓ above 2dn will have no effect, however, we see empirically that the effect of ℓ is minimal for values of ℓ greater than 20. The right plot shows the effect of varying n between 3 and 31. Noting that a logarithmic scale is used, the average number of rejected sets exhibits growth that is clearly exponential in n.
Figure 1. Efficiency of rejection sampling. Average number of rejections when generating a pairwise bounded set with our rejection sampling heuristic. Each plot shows the effect of varying one of the three parameters (n, ℓ, d). Data points are connected with cubic splines. Note the logarithmic scale used in the right plot.
A separation of weight distributions
One of the key motivations for the development of methods to generate pairwise bounded sets from an appropriate distribution is that it can be used to determine whether there is a separation between the probability distribution of the weight of a random valid motif set and that of a random decoy set. We use the sampling method just described to generate 1000 random motif sets and 1000 random decoy sets for varying values of (ℓ, d) and n. For each random motif and decoy set witnessed we calculated the weight of the set. Figure 2 depicts, for values considered for (ℓ, d) and n, the distribution of the weight of the 1000 random motif sets and that of the 1000 random decoy sets. The data illustrate an adequate separation between the distributions.
Figure 2. Weight distribution histograms. Histograms showing weight distributions for motif sets and decoy sets. Normal distributions fitted to the data are shown to indicate that the weight distributions are approximately normal.
As the value of n increases, the separation between the distributions becomes more prevalent since the probability distributions become more concentrated around their means and the means themselves diverge. Further, the dichotomy is again more evident when (ℓ, d) is increased from (15, 4) to (18, 6). When n is even moderately large we can use the weight to determine accurately whether the set is a motif set or a decoy set and as n increases this method of using the weight as an indicator will likely increase in accuracy. Similar conclusions can be made when ℓ and d increase. These results suggest that the simple heuristic of using the weight to determine whether a pairwise bounded set is a valid motif set or a decoy set will enable computationally challenging instances of the CONSENSUS STRING problem (e.g. when n ≥ 20 or (l, d) is equal to (18, 6)) to be solved efficiently with minimal probability of error.
These empirical trends illustrate the analytical results of Boucher et al. [10] that demonstrate that the distribution of the weight of a random motif set is tightly concentrated around its mean. The following theorem proves that the distribution of W_{m }is sharply concentrated around its mean; specifically it provides exponential tail bounds.
Theorem 1 (Strong concentration bound for motif sets [10]). Let W_{m }be the weight of a random motif set and μ_{m }be the expected value of W_{m}. Then for any λ > 0,
It is currently open to prove an analogous result to Theorem 1 for an arbitrary decoy set. This is a considerably more challenging problem due to the lack of a combinatorial characterization of a decoy set.
sMCLWMR: an efficient method to detect motifs in large data sets
In 2007, MCLWMR was developed specifically for the problem of detecting weak motifs in genomic data [10]. One of the main contributions of MCLWMR is the introduction of a novel weightedgraph model for motif recognition. Unfortunately, MCLWMR was unable to detect motifs beyond when ℓ = 18, d = 6, m = 1000, and n ≥ 20 [10]. Eskin and Pevzner reported similar results for various motif recognition programs [6], and Feng et al. showed limited accuracy for the (15, 4) problem with 20 strings of length 600 [17]. Specific motif recognition problemsthat is, the problem for specific values of n, m, ℓ, and dhave remained intractable. For example, MCLWMR was unable to solve any instance of the (25, 8) motif recognition problem with n = 20.
MCLWMR uses graph clustering to determine pairwise bounded sets that might be valid motifs. The major impediment to the efficiency of MCLWMR was the exponentialtime refinement algorithm used to determine which "candidate motif sets" (i.e. pairwise bounded sets) have a consensus string [10]; this step becomes a bottleneck for solving challenging weak motif instances, such as (18, 6), when the number of such candidate sets increases dramatically [4]. Boucher and Brown [18] give a probabilistic heuristic for solving the consensus string problem, which filters candidate sets based on a "majority vote", that has acceptable accuracy when n is significantly large (i.e. when n ≥ 20). We propose a probabilistic algorithm that eliminates the need for a strong bound on n; our novel algorithm uses a candidate set's weight to determine quickly and with a small probability of error whether the set is a decoy set or a motif set.
Overview of system
sMCLWMR considers a weighted graph representation of the data set, where each substring of length ℓ is represented by a vertex and the construction of our graph ensures that the motif instances represented by vertices in the graph are connected to each other and form a clique of size n, though the converse need not hold. In this model, the problem of finding pairwise bounded sets in the data reduces to finding cliques of size n in the graph .
1. The vertex set contains a vertex v_{i, j }representing the ℓlength substring in string i starting at position j, for each i and j = 1, 2, ..., m  ℓ + 1. There are n(m  ℓ + 1) vertices.
2. Each pair of vertices v_{i, j }and v_{i', j'}, for i ≠ i' is joined by an edge if and only if the corresponding substrings are at Hamming distance at most 2d.
3. An edge between vertices having distance k has weight ℓ  k for d <k ≤ 2d, or 10(ℓ  k) for k ≤ d. This emphasizes substrings at small distances.
We chose to use the Markov cluster algorithm (MCL) [19] to cluster the graph due to its ability to handle large weighted graphs. We reduce the size of the instance being passed to MCL by considering subgraphs = {G_{1}, G_{2}, ..., G_{mℓ+1}}, where, for some arbitrary choice of reference string R, G_{j }is the subgraph induced by the closed neighborhood of the reference vertex v_{R, j}. This is more efficient than searching all of at once. MCL then clusters each G_{i }∈ to determine subgraphs that are highly interconnected (high edge weight within a cluster). A clique in G_{i }that represents a pairwise bounded set must have size n and have weight at least (ℓ  2d) since each pair of vertices must be adjacent. We filter out the clusters produced by MCL that do not meet these criteria since they cannot contain sufficiently large cliques. MCLWMR uses a dynamic programming algorithm to determine which pairwise bounded sets (or cliques) represent valid motif sets; this computationally intensive step limits its ability to solve many motif recognition instances.
Figure 2 illustrates that both the weight of a random motif set and that of a random decoy set are approximately normally distributed, and shows a separation between these distributions. Using the rejection sampling method described earlier we calculate the mean and standard deviation of the weight of a random motif set and the weight of a random decoy set. We use N(μ, σ^{2}) to denote a normal distribution with mean μ and variance σ^{2}. Let random variables W_{m }and W_{d }denote the weight of a random motif set and the weight of a random decoy set, respectively. Let μ_{m }and respectively denote the mean and variance of the distribution of W_{m }and similarly, let μ_{d }and respectively denote the mean and variance of W_{d}. Assuming that W_{m }~ N (μ_{m}, ) and W_{d }~ N (μ_{d}, ), we can determine the values α_{m }and α_{d }such that:
If α_{m }<α_{d }then we can use the weight of a pairwise bounded set of strings to determine whether the set is a decoy or a motif as follows: calculate the weight w of the set and, if w ≤ α_{m }or w ≥ α_{d }then return that the set is a motif or a decoy, respectively; otherwise, use the dynamic programming algorithm to classify the set. Hence, if α_{m }<α_{d }then more than 99% of pairwise bounded sets will be classified correctly by considering the weight of the set. Typically the gap between α_{m }and α_{d }is large enough to guarantee that this rate is far higher than 99%. In theory it is possible that a set could be misclassified (e.g. if a motif set has weight greater than α_{d}) though in practice the probability of this happening is negligible and does not affect the performance of the algorithm.
To increase the efficiency of sMCLWMR, we include a precalculated table storing μ_{m}, μ_{d}, and for common values of ℓ, d, and n (for examples see Table 1). We varied n to be between 10 and 50, ℓ to be between 15 and 30, and d to be between ⌊ℓ/5⌋ and ⌊ℓ/2⌋ Values with weaker motifs or with small data sets (i.e. when n ≤ 10) are not considered since it was shown that MCLWMR performs efficiently for these instances [10].
Table 1. Weight distribution properties.
Results and discussion
Performance of sMCLWMR on synthetic data
We follow the experimental methods of Pevzner and Sze [1], and Buhler and Tompa [4] by considering the performance of sMCLWMR in comparison to other contemporary and wellknown motif recognition programs on synthetic data. We fix n to be equal to 20, m to be 600, and consider varied values of ℓ and d. To produce random motif recognition instances, we generate a random motif consensus of length ℓ, then generate n occurrences of the motif, each generated from the consensus by randomly choosing d positions and for each of the d positions choosing a random replacement base from the four possible bases (A, C, G, T). We construct m background strings of length n and insert the generated motifs into a random position in the string. For each of the (ℓ, d) combinations, 100 randomly generated sets of input strings (n = 20, m = 1000) were generated. The implementation of sMCLWMR is in C++.
We note that all experimental tests were performed on a Linux machine with a 64bit 2600 MHz processor and 1 Gbyte of RAM running Ubuntu. We compared the performance of sMCLWMR with that of the following motif recognition programs: PROJECTION [4], MCLWMR [10], PMSprune [8], and Voting [9]. All programs were run on the same Linux machine with the same data sets. These motif recognition programs were chosen for their availability, performance, and widespread use; they are appropriate for comparison with sMCLWMR because of the previously described capability in solving weak motif instances and because of their availability to be run on the described machine. The results of Voting, PMSprune, and PROJECTION are similar to the ones reported by Davila et al. [8], and to Chin and Leung [12], both of whose testing was completed on a machine with a slightly slower processor and the same core memory size.
We define the success rate of a given program using the performance coefficient used by Pevzner and Sze [1], Buhler and Tompa [4], and others [9,12]. Let K denote the set of tℓ base positions in the t occurrences of the planted motif, and let P denote the corresponding set of base positions in the t occurrences predicted by an algorithm. The algorithm's success rate is defined as K ∩ P/K ∪ P. Table 2 illustrates the comparison between the running time of sMCLWMR and that of the other programs. Our aim was to test the selected programs on their capability to solve challenging motif instances (i.e. when d is significantly large with respect to ℓ). In Table 2 "" implies that the program was not capable of solving the motif instance on the described machine in a reasonable amount of time, which we define to be at most 20 hours, or with reasonable accuracy, which we define to be at least 75%. Two significant trends are witnessed in the data: sMCLWMR is capable of solving very hard instances of motif recognition (i.e. when ℓ = 30 and d = 9) and gives a dramatic improvement over the existing programs for instances where ℓ ≥ 14 (for instances where ℓ ≤ 12 sMCLWMR had comparable or better performance to the other programs). We note that all programs except PROJECTION achieved a 100% success rate on all motif instances; in Table 2 we put the success rate of PROJECTION in brackets.
Table 2. Performance on synthetic data with varying (ℓ, d).
There exist realgenomic data sets which contain a large number of sequences. For example, a data set, labeled as hm20, in the TRANSFAC database [20] has 34 input strings. Unfortunately, it is uncommon to test motif recognition programs with synthetic data sets with greater than 20 input strings. For example, the following motif recognition algorithms were tested with data sets with at most 20 strings: PROJECTION [4], Winnower [1], MITRA [6], PSM1 [7], PMSprune [8], the Voting algorithm [9], MCLWMR [10], and VAS [12]. We aim to investigate the capability of sMCLWMR  as well as other motif recognition programs  in solving motif recognition instances with a large number of strings. The other programs tested include MCLWMR, Voting, and PMSprune. Table 3 shows that sMCLWMR was capable of solving instance with up to 40 strings. Again, as in Table 2 "" implies that the program was not capable of solving the motif instance on the described machine in a reasonable amount of time, which we define to be at most 20 hours, or with reasonable accuracy, which we define to be at least 75%. The capability of sMCLWMR in solving motif recognition instances with a large number of strings can easily be explained by the fact that the runtime of the method used to solve Consensus String scales slowly in n and therefore, has efficient running time even when n is large (i.e. n = 40).
Table 3. Performance on synthetic data with varying n.
Using sMCLWMR to find regulatory elements
An important biological challenge is to identify DNA binding sites of transcription factors. In this section, we demonstrate the use of sMCLWMR in discovering these DNA string "motifs" in data sets with a large number of DNA strings. Tompa et al. extensively assess 13 motif recognition tools [2] using test sets that make use of transcription factor binding sites. The binding sites were obtained from the TRANSFAC database [20] which contains only eukaryotic transcription factors. The TRANSFAC database is extremely comprehensive, containing data from a large variety of species, including yeast, mus, oryctolagus cuniculus, and homo sapiens [20]. For more details concerning the data set, including the selection process for transcription factors and binding sites from TRANSFAC, see Tompa et al. [2].
We ran sMCLWMR on a randomly selected set of set of transcription factors from those of Tompa et al. [2]. Each transcription factor gives rise to one set of strings. The number of strings varied from 34 (hm20) to 8 (hm26) and the string length (parameter m) varied from 700 bp to 2000 bp. Experimental results are shown in Table 4. sMCLWMR was capable of discovering motifs for these data sets, as well as many motifs not yet found by the motif recognition programs assessed by Tompa et al. [2]. The known binding sites shown in Table 4 are as given by the TRANSFAC database Tompa et al. [2].
Table 4. Motif recognition on biological data.
Conclusion
In this paper we investigate the relationship between the weight of a decoy set and the weight of a motif set by means of random sampling. We discuss a rejection sampling strategy, and propose a means to make this uniform sampling method more efficient. Using our proposed sampling algorithm, we study the probability distributions of the respective weights of a random motif set and a random decoy set. We conclude that the weight of a pairwise bounded set can accurately predict whether the set is a valid motif set; we then use this heuristic to develop a program that efficiently detects motifs in large data sets. Our focus was to develop an efficient program that solves a combinatorial version of the motif recognition problem. A position weight matrix (PWM) is another commonly used representation of motifs in biological strings [21]. The application of techniques described in this paper  graph clustering and satistical thresholds  to the PWM model of motif recognition warrants further investigation.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Concept, implementation, and experiments: CB. Analysis and manuscript preparation: CB and JK.
Acknowledgements
This project was supported by the National Sciences and Engineering Research Council of Canada and the Walter C. Sumner Memorial Fellowship. The authors are grateful to Daniel G. Brown for his discussions and insights concerning the results presented in this paper, and Francis Y.L. Chin and Henry C.M. Leung for making their motif recognition program available to us. We are also grateful to the referees for their many helpful comments.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth AsiaPacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S1.
References

Pevzner P, Sze S: Combinatorial approaches to finding subtle signals in DNA strings.

Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al.: Assessing computational tools for the discovery of transcription factor binding sites.
Nat Biotechnol 2005, 23:137144. PubMed Abstract  Publisher Full Text

Evans PA, Smith A, Wareham HT: On the complexity of finding common approximate substrings.
Theor Comput Sci 2003, 306(13):407430. Publisher Full Text

Buhler J, Tompa M: Finding motifs using random projections.
J Comput Biol 2002. PubMed Abstract  Publisher Full Text

Sze S, Lu S, Chen J: Integrating sampledriven and patterdriven approaches in motif finding.

Eskin E, Pevzner P: Finding composite regulatory patterns in DNA strings.
Bioinformatics 2002, 18(Suppl 1):S354S363. PubMed Abstract  Publisher Full Text

Rajasekaran S, Balla S, Huang CH: Exact algorithms for planted motif problems.
J Comput Biol 2005, 12(8):11171128. PubMed Abstract  Publisher Full Text

Davila J, Balla S, Rajasekaran S: Fast and practical algorithms for planted (l, d) motif search.

Chin FYL, Leung CM: Voting algorithms for discovering long motifs.

Boucher C, Church P, Brown D: A graph clustering approach to weak motif recognition.

Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME.

Chin FYL, Leung CM: An efficient algorithm for string motif discovery.

Pisanti N, Carvalho A, Marsan L, Sagot MF: RISOTTO: Fast extraction of motifs with mismatches.
Proc LATIN 2006 2006, 757768. Publisher Full Text

Pavesi G, Mereghetti P, Mauri G, Pesole G: Weeder Web: discovery of transcription factor binding sites in a set of sequences from coregulated genes.
Nucleic Acids Res 2004, 32(Web Server issue):W199W203. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li M, Ma B, Wang L: Finding similar regions in many strings.
J Comput Syst Sci 2002, 65:7396. Publisher Full Text

Feng WS, Wang Z, Wang L: Identification of distinguishing motifs.

Boucher C, Brown D: Detecting motifs in a large data set: applying probabilistic insights to motif finding.

van Dongen S: Graph clustering by flow simulation. PhD thesis. University of Utrecht; 2000.

Wingender E, Dietze P, Karas H, Knüppel R: TRANSFAC: a database on transcription factors and their DNA binding sites.
Nucleic Acids Res 1996, 24:238241. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

BenGal I, Shani A, Gohr A, Grau A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I: Identification of transcription factor binding sites with variableorder Bayesian networks.
Bioinformatics 2005, 21(11):26572666. PubMed Abstract  Publisher Full Text

Frith MC, Hansen U, Spouge JL, Weng Z: Finding functional sequence elements by multiple local alignment.
Nucleic Acids Res 2004, 32:189200. PubMed Abstract  Publisher Full Text  PubMed Central Full Text