Abstract
Background
Finding over or underrepresented motifs in biological sequences is now a common task in genomics. Thanks to pvalue calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count pvalues in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required.
Results
We develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K12 backbone versus variable strainspecific loops.
Conclusion
The exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use.
Background
Detecting motifs with a significantly unexpected frequency in DNA sequences has become a very common task in genome analysis. It is generally addressed to propose candidate functional motifs based on their statistical properties [13]. Lots of statistical methods have been developed to that purpose (see the recent surveys by [4] or [5] and references therein) and satisfactory solutions exist now to find exceptional motifs thanks to pvalue calculations.
More recently, a new related question has arisen in the literature concerning the comparison of motif exceptionalities in two sequences. One wants for instance to compare particular sets of genes [6], upstream regions of CDSs versus whole chromosome [7], structural domains [8], CDSs versus intergenic regions, conserved regions versus strainspecific regions of bacterial genomes [9], or chromosomes from the same species [10]. Chromosomes from different species can also be compared from a comparative genomics point of view. In all these works, one would like to know if a given motif is significantly more exceptional in one sequence compared to another one. This criterion is usually used to identify motifs which are specific from some regions or expected to be more frequent in some particular parts of the genome. Transcription factor binding sites, for instance, are expected to be more frequent in upstream regions than along the whole genome.
Surprisingly, no rigorous statistical method has been proposed yet to decide if a given motif, exact or not, is significantly more exceptional in one sequence compared to a second one. Of course, two pvalues can be calculated separately on each sequence to know if the motif is exceptional in these sequences but the difficult point is how to compare these two pvalues from a statistical point of view. It is indeed not sufficient to make the difference or the ratio to know if the two pvalues are significantly different; One needs a statistical test.
In this paper, we propose two statistical tests to compare the motif count exceptionalities in two independent sequences. In the Results Section, we first present the underlying model for motif occurrences and the null hypothesis to test, namely the motif is similarly exceptional in both sequences. Then we derive an exact binomial test and an asymptotic likelihood ratio test adapted for frequent motifs. Usage conditions and power of both tests are described in the Discussion Section, together with a more refined model for occurrences of overlapping words and the associated tests. An illustration of the method is finally given; We compare the octamer exceptionalities in two sets of regions (backbone/loops) from the Escherichia coli K12 leading strands. These two sets correspond to the mosaic structure of E. coli's genome when comparing the two strains K12 and O157:H7: the backbone represents the common regions whereas the loops are specific to the K12 strain. As a toy example all along this paper, we will treat in detail the case of the palindromic octamer cagcgctg which occurs respectively 30 times in the loops (758434 bps long) and 113 times in the backbone (3 882 513 bps long).
Results
Poisson model
In sequence i, the motif count N_{i }is supposed to have a Poisson distribution with mean (and variance) λ_{i}. This distribution has been shown to fit correctly theoretical (in Markovian sequences, for example) as well as observed count distributions of nonoverlapping words [11]; A nonoverlapping word is a word such that two occurrences of itself can not overlap in a sequence.
The mean count λ_{i }in sequence i must account for three parameters: (i) the length ℓ_{i }of the sequence, (ii) the composition of the sequence, (iii) the possible exceptionality of the motif in the sequence.
Expected intensity
The composition of the sequence can be accounted for via the probability μ_{i }for the motif to occur at any position in the sequence under a simple model. The most popular models are Markov chain models which can fit the frequencies in mono, di, trinucleotides, etc. Indeed, the Markov chain model of order m (denoted by Mm) takes the (m + 1)mer composition into account. Under such models, the occurrence probability μ_{i }of a hletter motif w = w_{1 }w_{2 }... w_{h }on the {a, c, g, t} alphabet can be expressed in terms of counts of its subwords of length m and m + 1 [5]. For instance, here are the expression of μ_{i }in models M0, M1 and M(h  2) which fit respectively the composition in bases, in dinucleotides and in oligonucleotides of length h  1:
where N_{i }(·) denotes the count in sequence i.
If one does not want to account for the sequence composition (this case will be referred to as model M00), then μ_{i }simply depends on the motif, hence μ_{1 }= μ_{2 }= (1/4)^{h}.
The choice of the Markov chain model depends on the sequence composition one wants to fit. For instance, model M2 is often used for coding DNA sequences to take the codon bias into account. Higher the model order, better the fit, but usually the model order is bounded either by h  2 or because the sequence is too small (the number of parameters to be estimated increases exponentially with the order).
Table 1 gives the expected counts ℓ_{i }μ_{i }for the motif cagcgctg in the E. coli loops/backbone sequences. Since N_{1 }= 30 and N_{2 }= 113, we see that this motif is highly overrepresented in both sequences under models M00, M0 and M1. However, under the richest possible model (M6), it is overrepresented in sequence 1 (loops) but underrepresented in sequence 2 (backbone).
Table 1. Expected count for cagcgctg in the loops (1) and in the backbone (2) of E. coli leading strands under different models.
Exceptionality coefficient
When the motif is not exceptional with respect to the considered model, the mean count λ_{i }is simply ℓ_{i }μ_{i}. For exceptional motifs, i.e. motifs with an observed count N_{i }far from its expectation ℓ_{i }μ_{i}, under a given model, the mean count λ_{i }should reflect this exceptionality.
We therefore introduce an exceptionality coefficient k_{i }which allows λ_{i }to be greater (or smaller) than the expected value:
λ_{i }: = k_{i }ℓ_{i }μ_{i}.
In the following, parameters ℓ_{i }and μ_{i }will be supposed to be known a priori: they can be considered as two correction terms. The inference will only be made on k_{i}.
Hypothesis testing
Comparing the (potential) exceptionality of a motif in two sequences is equivalent to test the null hypothesis H_{0 }= {k_{1 }= k_{2}}.
We emphasize that the respective values of k_{1 }and k_{2 }can be larger than one (unexpectedly frequent motif), smaller than one (unexpectedly rare motif) or close to one (motif with expected count). These values do not matter: our only concern is to know if they are significantly different or not.
Exact binomial test
We first propose an exact test based on a general property of the Poisson distribution. If N_{1 }and N_{2 }are two independent Poisson counts with respective means λ_{1 }and λ_{2}, the distribution of N_{1 }given their sum N_{+ }: = N_{1 }+ N_{2 }is binomial [12]: N_{1 }~ (N_{+}, π) with
Under H_{0}, we have π = π_{0 }with
because k_{1 }= k_{2}. In absence of correction (M00 model) for the sequence composition (i.e. μ_{1 }= μ_{2}), we have π_{0 }= ℓ_{1}/(ℓ_{1 }+ ℓ_{2}). If furthermore the two sequences have the same length, we get π_{0 }= 1/2.
Moreover, the proportion π and then the expectation of N_{1}, increases as the ratio k_{1}/k_{2 }increases. Therefore, the pvalue for the onesided alternative H_{1 }= {k_{1 }> k_{2}} is p_{B }= Pr { (n_{+}, π_{0}) ≥ n_{1}}, i.e.
where n_{+ }and n_{1 }are the observed values of N_{+ }and N_{1}.
Table 2 gives the probability π_{0 }and the pvalue p_{B }for the motif cagcgctg in E. coli. At level 5%, the null hypothesis is accepted under models M00 and M6 meaning that the motif is similarly exceptional in both sequences with respect to their length and/or 7mer composition. However, {k_{1 }= k_{2}} is rejected at level 5% against {k_{1 }> k_{2}} under models M0 and M1; since cagcgctg is overrepresented in both sequences, it means that it is significantly more exceptionally overrepresented in sequence 1 (loops) with respect to the base and/or dinucleotide compositions of both sequences.
Table 2. Probability π_{0 }and pvalue p_{B }under different models for cagcgctg in the E. coli loops/backbone comparison.
Likelihood ratio test (LRT)
Another test statistic based on the comparison of the likelihood of the data under the H_{0 }and the alternative hypothesis H_{1 }= {k_{1 }≠ k_{2}} can be derived. This statistic is known as the Likelihood Ratio Test (see [13], vol. IV). In our model (see the Methods Section), it is defined as
where π_{0 }is defined in (1). Under the null hypothesis, its asymptotic distribution is a chisquare distribution with one degree of freedom.
This test is twosided, because, under H_{1}, parameters k_{1 }and k_{2 }are estimated independently (in particular, without the constraint k_{1 }> k_{2}). The exact distribution of LRT could be calculated via permutation techniques but the computation time would be tremendeous for large counts. We will then calculate the following asymptotic pvalue:
where n_{2 }is the observed value of N_{2 }and χ^{2 }~ χ^{2 }(1).
Table 3 gives the LRT statistic and the associated pvalue for the motif cagcgctg in E. coli. Remember that the LRT is twosided, so p_{L }have to be divided by two when compared to the onesided binomial pvalue p_{B}. We see that the significances obtained with the LRT are different from the ones obtained with the exact binomial test, but the qualitative conclusions are the same.
Table 3. LRT statistic and associated pvalue p_{L }under different models for cagcgctg in the E. coli loops/backbone comparison.
Chisquare test
Another standard asymptotic test is the chisquare test where the counts N_{i }are compared to their expected values _{i }under H_{0 }given the total count N_{+}:
where _{1 }= π_{0 }N_{+ }and _{2 }= (1  π_{0})N_{+}. Under the null hypothesis, X^{2 }has also an asymptotic chisquare distribution with one degree of freedom. It is also an intrinsically twosided test. Further analyzes (including simulations) not presented here (see [14]) show that this test performs very similarly to the LRT in every situations. Note that the chisquare test is the same as the score test [13].
Discussion
LRT distribution
The chisquare distribution of the LRT statistic is only asymptotic, so the actual level may be different from the nominal one (typically α = 5%). To measure this difference, we have calculated this actual level for different values of π_{0 }and N_{+}. Since LRT is a function of N_{1}, the actual level can be derived from the exact distribution of N_{1 }given N_{+ }which is binomial (see Results Section).
Figure 1 compares both levels (actual and nominal). Since the counts are discrete, the actual level can never be exactly α leading to oscillations in the plot. We see that the nominal level is only reached with N_{+ }≃ 1000 for π_{0 }= 0.5 and even later for π_{0 }= 0.95 (or π_{0 }= 0.05). It means that the chisquare approximation of the LRT statistics is only valid for motifs with many total occurrences.
Figure 1. Actual level (log scale) of the LRT as a function of N_{+ }(log scale) for a nominal level α = 0.1%, 1%, 5% or 10% and probability π_{0 }= 0.5 (left) and 0.95 (right). (Since the LRT test is twosided, the right plot also holds for π_{0 }= 0.05).
Regarding the motif cagcgctg, because π_{0 }is about 15% (cf. Table 2), the picture is close to the right plot of Figure 1; In fact, with a total count of 143, the actual level is respectively 0.095%, 1.1%, 5.1% and 12.5% for a nominal level α equal to 0.1%, 1%, 5% and 10%.
LRT as a contrast measure
The LRT statistic can still be used as a contrast measure, i.e. a measure of the difference, between the two exceptionalities. For large values of N_{+ }it is much faster and easier to compute than the binomial pvalue. We will see in the illustration below that the two quantities are strongly correlated.
Decidability limits for the binomial test
Because the binomial test is exact, the actual and nominal levels are equal. The significance can then always be determined. It would be maximal when N_{1 }= N_{+ }(i.e. N_{2 }= 0) and the corresponding pvalue p_{B }would be equal to . Therefore, if this minimal pvalue is greater than the desired level α (typically 5%), no significance conclusion can be made. This happens when α, i.e. when N_{+ }≥ ln (α)/ln(π_{0}).
Figure 2 gives this critical value of N_{+ }for various values of π_{0 }and α. We see, for instance, that for π_{0 }= 0.7 and N_{+ }= 10, one may get significant results at a level greater than 5% but not at a level smaller than 1%.
Figure 2. Minimal count N_{+ }to get a significant result with the binomial test as a function of the probability π_{0}. Curves correspond to different levels α = 0.1%, 1%, 5% or 10% (from top to bottom).
Power
An important property for a statistical test is its ability to detect departure from the null hypothesis. This ability is measured by the power of the test which is the probability to exceed the significance threshold (defined under H_{0}) when the true parameter satisfies H_{1}. In our case, the parameter of interest is
which is equal to π_{0 }when k_{1 }= k_{2}. So the departure from H_{0 }will be measured by the ratio k_{1}/k_{2 }when it differs from 1.
Exact binomial
Figure 3 presents the power of the exact binomial test when k_{1}/k_{2 }increases. As expected, the power increases with N_{+}. Moreover, it decreases when π_{0 }increases i.e. when the expected ratio ℓ_{1 }μ_{1}/(ℓ_{2 }μ_{2}) increases. It means that, when the motif is already expected to be much more frequent in sequence 1 than in sequence 2, it is more difficult to detect that its exceptionality in the first sequence is also higher.
Figure 3. Power of the exact binomial test with level α = 5% as a function of k_{1}/k_{2 }(xaxis) for different values of π_{0}. Curves correspond to different values of the total count N_{+ }= 5 (dashed black), 10 (dashed red), 20 (dashed blue), 50 (dashed green), 100 (solid black), 500 (solid red) and 1000 (solid blue). Missing curves correspond to the values of N_{+ }for which no significant results at level α = 5% can be obtained (cf. the Discussion Section).
The motif cagcgctg occurs N_{+ }= 143 times in the whole genome. In the different models considered in Table 2, probability π_{0 }is between 11.6% and 16.4%. The power of the binomial test in this case can therefore be read in Figure 3, in the two top plots between the black and red solid lines. We see that a ratio k_{1}/k_{2 }= 2 can be detected with probability greater than 90%, while a ratio of 1.5 will be detected with a bit more than 50% probability.
LRT
The same analysis can be made for the LRT tests. However, this only makes sense for sufficiently large N_{+}, to guaranty the validity of the chisquare distribution.
Case of overlapping words
Compound Poisson model
The distribution of overlapping word occurrences is better modeled by a compound Poisson process (see [15]) in the following way:
• The word occurs in clumps distributed according to a Poisson process. The number of clumps C_{i }in sequence i is hence a random Poisson variable with mean denoted by _{i}.
• The size V_{ic }of the cth clump (in sequence i) is random with geometric distribution:
The clump sizes are supposed to be independent. Parameter a_{i }is the overlapping probability of the motif and can be calculated under various Markovian models (see [5]).
In this setting, the count N_{i }is hence the sum of the sizes of C_{i }clumps and has the PolyaAeppli (or geometric Poisson) distribution (see [12]). We have (see [5]) _{i }= (1  a_{i}) λ_{i}. In the case of a nonoverlapping word, we have C_{i }= N_{i}, a_{i }= 0 and λ_{i }= λ_{i}. For overlapping words, the mean clump size is equal to 1/(1  a_{i}) and increases with a_{i}.
Tests
An overlapping word can occur with an exceptional frequency (i) because of an exceptional number of clumps or (ii) because of exceptional sizes of clumps. Then comparing the exceptionalities of an overlapping word in two sequences leads to compare the number of clumps C_{1 }with C_{2}, and/or the sizes V_{1c}'s with V_{2c}'s.
Comparison of the number of clumps
In this compound Poisson model, the number of clumps in each sequence is Poisson distributed. The comparison of the counts C_{1 }and C_{2 }is then exactly equivalent to the comparison of the counts N_{1 }and N_{2 }studied in the Results Section, replacing λ_{i }by _{i }and μ_{i }by _{i }:= (1  a_{i}) μ_{i}.
Exact test for the overlapping probability under M00
The question is now to test the null hypothesis H_{0 }= {a_{1 }= a_{2}}. This comparison is made conditionally to the observed counts N_{1 }and N_{2}. It only makes sense if the motif occurs at least once in each sequence, i.e. if N_{1}, N_{2}, C_{1 }and C_{2 }are all larger than (or equal to) 1. In this case, the first occurrence necessarily corresponds to the first clump and the C_{i } 1 last clumps have to be chosen among the other N_{i } 1 motif occurrences. Since a motif occurrence (except the first one) corresponds to a clump occurrence with probability 1  a_{i}, the number of clumps (except the first one) has a binomial distribution:
C_{i } 1 ~ (N_{i } 1, 1  a_{i}) (2)
which means that the expected number of clumps decreases when the overlapping probability increases.
Following the same strategy as for the nonoverlapping case, we base our test on the distribution of C_{1 }given the total clump count C_{+ }= C_{1 }+ C_{2}. Under H_{0}, (C_{1 } 1) has an hypergeometric distribution (N_{+ } 2, N_{1 } 1, C_{+ } 2) (see [12], Eq. (3.23)):
The overlapping probability a_{1 }is then significantly greater than a_{2 }if the probability Pr{C_{1 }≤ c_{1}N_{1}, N_{2}, C_{+}} is smaller than a given level α.
Exact test in the general case
The previous test does not account for the composition of the sequences. The overlapping probabilities a_{1 }and a_{2 }can be expected to be different, according to some null model. In this case, the true overlapping probability in sequence i is b_{i }= h_{i }a_{i}, where h_{i }is an exceptionality coefficient (analogous to k_{i }for the mean count). The problem is then to test H_{0 }= {h_{1 }= h_{2}}. Such a test is proposed in Appendix: it involves the generalized negative hypergeometric distribution.
Asymptotic tests
As for the counts N and C, asymptotic tests such as likelihood ratio, chisquare or score tests can be derived to compare exceptionalities in terms of overlaps. These tests are not presented here to avoid further statistical developments but also because the small overlapping probabilities generally observed make them rarely relevant.
Illustration
Materials
Comparing complete genomes of strains of single bacterial species allows to determine highly conserved regions (socalled backbone) and numerous strainspecific DNA segments (socalled loops) for each strain. These mosaic structures help to understand the evolution of bacterial genomes. Indeed, the backbone probably corresponds to the common ancestral strain and is under vertical pressure whereas the loops may be associated with mobile elements or strainspecific pathogenicity. Such backbone/loops segmentation has been systematically performed [9] and store in the public MOSAIC database [16]. We have extracted from this database the E. coli K12 specific loops (sequence 1) and the backbone (sequence 2) obtained from the pairwise alignment of the complete genomes of E. coli K12 laboratory strain and the enterohemorrhagic E. coli O157:H7 strain. As an illustration, we have compared the exceptionalities of all the 65536 octamers in the backbone versus in the loops. Such comparison will point out octamers which do not have the same constraint, with respect to their frequency, on the loops versus on the backbone.
Exact binomial test
Figure 4 presents the significance of the binomial test for all octamers in the backbone/loops comparison. The limits between the different significance levels are clear under M00 because the probability π_{0 }is the same for all octamers, while they are fuzzy under M1 because π_{0 }depends on the octamer composition. In this case, same counts (N_{1}, N_{2}) may result in different p_{B }values. The distribution of the pvalue p_{B }is summarized in Table 4. The 10 motifs with smallest pvalues, i.e. with an exceptionality coefficient significantly higher in the loops than in the backbone, are listed in the top of Table 5. Multiple testing problems arise when we compare the exceptionalities of the 65 536 octamers simultaneously. Table 6 gives the number of significant octamers and the corresponding threshold when adjusting for a False Discovery Rate (FDR, [17]) of 1%. For example, under model M1 only 154 octamers are significantly more exceptional in the loops. These octamers have all a pvalue p_{B }smaller than 2.2 10^{5}.
Figure 4. Counts N_{2 }(xaxis: backbone) and N_{1 }(yaxis: loops) for all the octamers under the M00 (left) and M1 (right) models. The color indicates the significance of the binomial test in the M00 model: blue = 'p_{B }> 0.01', green = 'p_{B }< 0.01', yellow = 'p_{B }< 0.001', red = 'p_{B }< 0.0001'. The solid black line on the left indicates the expected ratio N_{1}/N_{2 }= π_{0}/(1  π_{0}).
Table 4. Number of significantly unbalanced octamers under different models and for different thresholds.
Table 5. Top: 10 motifs with smallest pvalue p_{B }(k_{loops }> k_{backbone}) for model M00, M0, M1 and M6. * indicates overlapping words. Bottom: 10 motifs with smallest pvalue
(k_{backbone }> k_{loops}).Table 6. Top: numbers of octamers significantly more exceptional in the loops when adjusting for a False Discovery Rate of 1% and associated thresholds for the pvalue p_{B }for different models. Bottom: idem for octamers significantly more exceptional in the backbone.
Symmetrically, to find the motifs with an exceptionality coefficient significantly higher in the backbone than in the loops, we have to test H_{0 }versus = {k_{2 }> k_{1}} using the pvalue defined as = Pr{ (n_{+}, π_{0}) ≤ n_{1}}. The 10 most significant motifs for this test are given at the bottom of Table 5. When adjusting for a False Discovery Rate of 1%, only 14 octamers under model M1 are significantly more exceptional in the backbone than in the loops. These octamers have all a pvalue smaller than 1.8 10^{6}. Note that under model M6, no octamer is significant after multiple testing adjustment.
According to the p_{B }list, the motif cagcgctg has rank 1 115 among 65 536 under the M1 model. Note that the well known Chi motif (gctggtgg) which is the most overrepresented octamer in E. coli genome has a value of 5.1 10^{5 }(rank 1 281) under the same model; It means that k_{backbone }is significantly higher than k_{loops }but due to multiple testing Chi is not among the significant octamers.
LRT versus binomial
We now compare the results provided by the two tests: binomial and LRT. Because the former is onesided and the latter is twosided, we use a signed version LRT^{s }of the LRT statistic which is equal to LRT when N_{1 }is greater than expected (N_{1 }≥ π_{0 }N_{+}) and to – LRT otherwise (N_{1 }<π_{0 }N_{+}). To make the graph more readable, we also transform the pvalue p_{B }into a Gaussian score S_{B }∈ ℝ:
S_{B }= Φ^{1 }(1  p_{B})
where Φ is the cumulative distribution function of the standard Gaussian distribution. High positive values of S_{B }correspond to motifs with an exceptionality coefficient in sequence 1 significantly higher than in sequence 2, while high negative values of S_{B }correspond to motifs having an exceptionality coefficient in sequence 1 significantly lower than in sequence 2.
We see in Figure 5 that the two statistics give very similar results for all the octamers in the backbone/loops comparison. Table 7 gives the Spearman and Kendall correlation coefficients between the two statistics for different models. Recall that Spearman's coefficient is the correlation between the ranks, while Kendall's one is the proportion of concordant pairs between the two rankings. This confirms that the LRT statistics is a reliable exceptionality comparison score, although the associated pvalue is questionable for small counts.
Figure 5. Signed LRT statistic LRT^{s }(yaxis) versus transformed binomial pvalue Φ^{1 }(1  p_{B}) (xaxis) under model M1 for all octamers in the E. coli backbone/loops comparison.
Table 7. Spearman and Kendall correlation coefficients between LRT^{s }and S_{B }for different models.
Note that the naive comparison between the two pvalues simply associated with the exceptionality of each motif in each sequence does not provide the same sets of significant octamers (see Figure 6). Such pvalues have been calculated using the Poisson approximation of the number of clumps.
Figure 6. Transformed binomial pvalue Φ^{1 }(1  p_{B}) (yaxis) versus log ratio between the two pvalues associated with the exceptionality of the motif in each sequence (xaxis) under model M1 for all octamers in the E. coli backbone/loops comparison.
Test for overlaps
Very few motifs have significant differences in their clumps sizes. Table 8 presents the results for the 4 motifs having a pvalue smaller than 10%. For all of them, no overlap is observed in the backbone (C_{2 }= N_{2 }means that all clumps are of size 1 while few are observed in the loops (C_{1 }<N_{1}). The probability a is the overlapping probability under model M00.
Table 8. Octamers with significant differences in terms of overlaps in the backbone/loops comparison.
Conclusion
We have proposed two complementary statistical tests to compare the exceptionalities of motif counts in two sequences. The binomial test is exact and particularly of interest for small counts (from a computational point of view). For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test. The LRT statistics is simple to calculate and can be directly interpreted as a contrast measure between the exceptionalities; its pvalue can be derived from the chisquare distribution. Both tests will be implemented in the R'MES software already devoted to exceptional motifs [18].
The likelihood ratio test can be generalized to more than two sequences. Suppose we want to compare I sequences S_{1}, S_{2},..., S_{I}. In each of them, we assume that the count N_{i }has a Poisson distribution with parameter λ_{i }= k_{i }ℓ_{i }μ_{i }and we want to test H_{0 }= {k_{1 }= k_{2 }= ⋯ = k_{I}} versus H_{1 }= {At least one k_{i }differs from the others}. The LRT statistics is
Under H_{0}, LRT has an asymptotic chisquare distribution with (I  1) degrees of freedom. The Chisquare test can be generalized as well.
Under the Poisson model, both tests can be easily used for degenerated motifs or more generally for sets of motifs. Let denote by a set of motifs; The count N_{i }(respectively the occurrence probability μ_{i}) will be the sum of the counts (resp. occurrence probability) of w for all motifs w from . However, the generalization is much more involved for the compound Poisson model because of the possible overlaps between motifs from the set; In particular, the overlapping probability a_{i }becomes a matrix [19].
We emphasize that these tests are valid only for independent sequences. They can not be used to detect skewed oligomers because the leading strand is not independent from the lagging strand [20]. This particular question requires the development of another rigorous statistical method; this is an ongoing work.
Finally, note that the exceptionality comparison of word counts in sequences is actually equivalent to the differential analysis of SAGE expression data [21]. Indeed, in the SAGE technology, the expression level of a given gene is measured by a number of associated tags and the problem is to compare the number of tags between two conditions. In such problem, no correction has to be done except for the total number of tags and our test statistics under model M00 are adapted.
Methods
Likelihood ratio test
The model presented in the Results Section can be rephrased as two Poisson processes with respective intensity k_{i }u_{i }(i = 1,2). To calculate the likelihood, we need to estimate the exceptionality coefficients k_{1 }and k_{2}. Under the alternative hypothesis, their respective maximum likelihood estimates (MLE) are _{1 }= N_{1}/(ℓ_{1 }μ_{1}) and _{2 }= N_{2}/(ℓ_{2 }μ_{2}). Assuming that the two sequences are independent, the loglikelihood of the two processes is
Under the null hypothesis, the common MLE of k_{1 }and k_{2 }is = (N_{1 }+ N_{2})/(ℓ_{1 }μ_{1 }+ ℓ_{2 }μ_{2}) and the loglikelihood is
The LRT is defined as twice the difference between _{1 }and _{0}: LRT = 2(_{1 } _{0}). The result follows after standard algebraic manipulations.
Appendix
Exact hypergeometric test
Conditional distribution of the number of clumps
The conditional distribution of C_{i } 1 given in (2) can be modified as
N_{i } C_{i }~ (N_{i } 1, b_{i})
where b_{i }= h_{i}a_{i }is the true overlapping probability. This version is preferable, since the exceptionality coefficient h_{i }directly appears here as a multiplicative constant. The conditional distribution of the difference N_{i } C_{i }given the clump counts C_{1 }and C_{2 }and the total count N_{+ }is a generalized negative hypergeometric distribution (see [12] p. 264 for the classical version and p. 270 for the generalization):
where A is the constant such that the sum over all n_{1 }between C_{1 }and N_{+ }is equal to one.
Test
Under H_{0 }= {h_{1 }= h_{2}}, the term b_{1}/b_{2 }can be replaced by a_{1}/a_{2}. The overlapping probability b_{1 }is significantly greater than b_{2 }if N_{1 }is significantly large, i.e. if Pr{N_{1 }≥ n_{1}C_{1}, C_{2}, N_{+}} is small. The power of this test can also be studied: under H_{0}, b_{1}/b_{2 }equals a_{1}/a_{2}, while under the alternative hypothesis, it is equal to (h_{1}/h_{2}) (a_{1}/a_{2}). The power of the test is then a function of h_{1}/h_{2}.
Authors' contributions
SR and SS developed the statistical methodology, analyzed the examples and wrote the paper. VV studied the usage conditions. All authors read and approved the final manuscript.
Acknowledgements
We thank Meriem El Karoui and MarieAgnès Petit for helpful discussions. We also thank the referees for their remarks. This work has been supported by the French Action Concertée Incitative IMPBio.
References

van Helden J, André B, ColladoVides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.
J Mol Biol 1998, 281:827842. PubMed Abstract  Publisher Full Text

El Karoui M, Biaudet V, Schbath S, Gruss A: Characteristics of Chi distribution on several bacterial genomes.
Research in Microbiology 1999, 150:579587. PubMed Abstract  Publisher Full Text

Bigot S, Saleh O, Lesterlin C, Pages C, El Karoui M, Dennis C, Grigoriev M, Allemand JF, Barre FX, Cornet F: KOPS: DNA motifs that control E. coli chromosome segregation by orienting the FtsK translocase.
EMBO J 2005, 24:37703780. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lothaire M: Applied Combinatorics on Words, Volume 105 of Encyclopedia of Mathematics and its Applications. Cambridge University Press; 2005.

Robin S, Rodolphe F, Schbath S: DNA Words and Models. Cambridge University Press; 2005.
[English version of ADN, mots et modéles, BELIN 2003].

Davidsen T, Rodland E, Lagesen K, Seeberg E, Rognes T, Tonjum T: Biased distribution of DNA uptake sequences towards genome maintenance genes.
Nucleic Acids Research 2004, 32:10501058. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Touzain F, Schbath S, DebledRennesson I, Aigle B, Leblond , Kucherov G: SIGffRid: Searching for transcription factor binding sites in bacterial genomes using comparative approach and biologically driven statistics.
2006.
[Preprint. Preliminary version in JOBIM 2005 proceedings, 417426].

Valens M, Penaud S, Rossignol M, Cornet F, Boccard F: Macrodomain organization of the Escherichia coli chromosome.
EMBO J 2004, 23:43304341. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chiapello H, Bourgait I, Sourivong F, Heuclin G, Jacquemard A, Petit MA, El Karoui M: Systematic determination of the MOSAIC structure – backbone versus strain specific loops – of bacterial genomes.
BMC Bioinformatics 2005, 6:171. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

McNeil J, Smith K, Hall L, Lawrence J: Word frequency analysis reveals enrichment of dinucleotide repeats on the human X chromosome and [GATA]n in the X escape region.
Genome Research 2006, 16:477484. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schbath S: Compound Poisson approximation of word counts in DNA sequences.

Johnson NL, Kotz S, Kemp AW: Univariate Discrete Distributions. Wiley: NewYork; 1992.

Armitage P, Colton T, (Eds): Encyclopedia of Biostatistics. Wiley; 1998.

Vandewalle V: Etude de motifs dans les séquences d'ADN : comparaison d'exceptionnalités. In Master's thesis. Institut National Agronomique ParisGrignon; 2005.

Robin S: A compound Poisson model for words occurrences in DNA sequences.
J R Statist Soc C 2002, 51:437451. Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerfull approach to multiple testing.

Hoebeke M, Schbath S: R'MES: Finding Exceptional Motifs. [http://genome.jouy.inra.fr/ssb/rmes/] webcite

Roquain E, Schbath S: Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain.

Salzberg S, Salzberg A, Kerlavage A, Tomb JF: Skewed Oligomers and Origins of Replication.
Gene 1998, 217:5767. PubMed Abstract  Publisher Full Text

Audic S, Claverie JM: The significance of digital gene expression profiles.
Genome Research 1997, 7:986995. PubMed Abstract  Publisher Full Text