Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Statistical tests to compare motif count exceptionalities

Stéphane Robin1*, Sophie Schbath2* and Vincent Vandewalle1

Author Affiliations

1 INA PG/ENGREF/INRA, UMR518 Unité Mathématiques et Informatique Appliquées, 75005 Paris, France

2 INRA, UR1077 Unité Mathématique, Informatique et Génome, 78350 Jouy-en-Josas, France

For all author emails, please log on.

BMC Bioinformatics 2007, 8:84  doi:10.1186/1471-2105-8-84


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/8/84


Received:15 September 2006
Accepted:8 March 2007
Published:8 March 2007

© 2007 Robin et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Finding over- or under-represented motifs in biological sequences is now a common task in genomics. Thanks to p-value calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count p-values in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required.

Results

We develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K-12 backbone versus variable strain-specific loops.

Conclusion

The exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use.

Background

Detecting motifs with a significantly unexpected frequency in DNA sequences has become a very common task in genome analysis. It is generally addressed to propose candidate functional motifs based on their statistical properties [1-3]. Lots of statistical methods have been developed to that purpose (see the recent surveys by [4] or [5] and references therein) and satisfactory solutions exist now to find exceptional motifs thanks to p-value calculations.

More recently, a new related question has arisen in the literature concerning the comparison of motif exceptionalities in two sequences. One wants for instance to compare particular sets of genes [6], upstream regions of CDSs versus whole chromosome [7], structural domains [8], CDSs versus intergenic regions, conserved regions versus strain-specific regions of bacterial genomes [9], or chromosomes from the same species [10]. Chromosomes from different species can also be compared from a comparative genomics point of view. In all these works, one would like to know if a given motif is significantly more exceptional in one sequence compared to another one. This criterion is usually used to identify motifs which are specific from some regions or expected to be more frequent in some particular parts of the genome. Transcription factor binding sites, for instance, are expected to be more frequent in upstream regions than along the whole genome.

Surprisingly, no rigorous statistical method has been proposed yet to decide if a given motif, exact or not, is significantly more exceptional in one sequence compared to a second one. Of course, two p-values can be calculated separately on each sequence to know if the motif is exceptional in these sequences but the difficult point is how to compare these two p-values from a statistical point of view. It is indeed not sufficient to make the difference or the ratio to know if the two p-values are significantly different; One needs a statistical test.

In this paper, we propose two statistical tests to compare the motif count exceptionalities in two independent sequences. In the Results Section, we first present the underlying model for motif occurrences and the null hypothesis to test, namely the motif is similarly exceptional in both sequences. Then we derive an exact binomial test and an asymptotic likelihood ratio test adapted for frequent motifs. Usage conditions and power of both tests are described in the Discussion Section, together with a more refined model for occurrences of overlapping words and the associated tests. An illustration of the method is finally given; We compare the octamer exceptionalities in two sets of regions (backbone/loops) from the Escherichia coli K12 leading strands. These two sets correspond to the mosaic structure of E. coli's genome when comparing the two strains K12 and O157:H7: the backbone represents the common regions whereas the loops are specific to the K12 strain. As a toy example all along this paper, we will treat in detail the case of the palindromic octamer cagcgctg which occurs respectively 30 times in the loops (758434 bps long) and 113 times in the backbone (3 882 513 bps long).

Results

Poisson model

In sequence i, the motif count Ni is supposed to have a Poisson distribution with mean (and variance) λi. This distribution has been shown to fit correctly theoretical (in Markovian sequences, for example) as well as observed count distributions of non-overlapping words [11]; A non-overlapping word is a word such that two occurrences of itself can not overlap in a sequence.

The mean count λi in sequence i must account for three parameters: (i) the length ℓi of the sequence, (ii) the composition of the sequence, (iii) the possible exceptionality of the motif in the sequence.

Expected intensity

The composition of the sequence can be accounted for via the probability μi for the motif to occur at any position in the sequence under a simple model. The most popular models are Markov chain models which can fit the frequencies in mono-, di-, tri-nucleotides, etc. Indeed, the Markov chain model of order m (denoted by Mm) takes the (m + 1)-mer composition into account. Under such models, the occurrence probability μi of a h-letter motif w = w1 w2 ... wh on the {a, c, g, t} alphabet can be expressed in terms of counts of its subwords of length m and m + 1 [5]. For instance, here are the expression of μi in models M0, M1 and M(h - 2) which fit respectively the composition in bases, in dinucleotides and in oligonucleotides of length h - 1:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M1">View MathML</a>

where Ni (·) denotes the count in sequence i.

If one does not want to account for the sequence composition (this case will be referred to as model M00), then μi simply depends on the motif, hence μ1 = μ2 = (1/4)h.

The choice of the Markov chain model depends on the sequence composition one wants to fit. For instance, model M2 is often used for coding DNA sequences to take the codon bias into account. Higher the model order, better the fit, but usually the model order is bounded either by h - 2 or because the sequence is too small (the number of parameters to be estimated increases exponentially with the order).

Table 1 gives the expected counts ℓi μi for the motif cagcgctg in the E. coli loops/backbone sequences. Since N1 = 30 and N2 = 113, we see that this motif is highly over-represented in both sequences under models M00, M0 and M1. However, under the richest possible model (M6), it is over-represented in sequence 1 (loops) but under-represented in sequence 2 (backbone).

Table 1. Expected count for cagcgctg in the loops (1) and in the backbone (2) of E. coli leading strands under different models.

Exceptionality coefficient

When the motif is not exceptional with respect to the considered model, the mean count λi is simply ℓi μi. For exceptional motifs, i.e. motifs with an observed count Ni far from its expectation ℓi μi, under a given model, the mean count λi should reflect this exceptionality.

We therefore introduce an exceptionality coefficient ki which allows λi to be greater (or smaller) than the expected value:

λi : = ki i μi.

In the following, parameters ℓi and μi will be supposed to be known a priori: they can be considered as two correction terms. The inference will only be made on ki.

Hypothesis testing

Comparing the (potential) exceptionality of a motif in two sequences is equivalent to test the null hypothesis H0 = {k1 = k2}.

We emphasize that the respective values of k1 and k2 can be larger than one (unexpectedly frequent motif), smaller than one (unexpectedly rare motif) or close to one (motif with expected count). These values do not matter: our only concern is to know if they are significantly different or not.

Exact binomial test

We first propose an exact test based on a general property of the Poisson distribution. If N1 and N2 are two independent Poisson counts with respective means λ1 and λ2, the distribution of N1 given their sum N+ : = N1 + N2 is binomial [12]: N1 ~ <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M2">View MathML</a> (N+, π) with

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M3">View MathML</a>

Under H0, we have π = π0 with

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M4">View MathML</a>

because k1 = k2. In absence of correction (M00 model) for the sequence composition (i.e. μ1 = μ2), we have π0 = ℓ1/(ℓ1 + ℓ2). If furthermore the two sequences have the same length, we get π0 = 1/2.

Moreover, the proportion π and then the expectation of N1, increases as the ratio k1/k2 increases. Therefore, the p-value for the one-sided alternative H1 = {k1 > k2} is pB = Pr {<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M2">View MathML</a> (n+, π0) ≥ n1}, i.e.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M5">View MathML</a>

where n+ and n1 are the observed values of N+ and N1.

Table 2 gives the probability π0 and the p-value pB for the motif cagcgctg in E. coli. At level 5%, the null hypothesis is accepted under models M00 and M6 meaning that the motif is similarly exceptional in both sequences with respect to their length and/or 7-mer composition. However, {k1 = k2} is rejected at level 5% against {k1 > k2} under models M0 and M1; since cagcgctg is over-represented in both sequences, it means that it is significantly more exceptionally over-represented in sequence 1 (loops) with respect to the base and/or dinucleotide compositions of both sequences.

Table 2. Probability π0 and p-value pB under different models for cagcgctg in the E. coli loops/backbone comparison.

Likelihood ratio test (LRT)

Another test statistic based on the comparison of the likelihood of the data under the H0 and the alternative hypothesis H1 = {k1 k2} can be derived. This statistic is known as the Likelihood Ratio Test (see [13], vol. IV). In our model (see the Methods Section), it is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M6">View MathML</a>

where π0 is defined in (1). Under the null hypothesis, its asymptotic distribution is a chi-square distribution with one degree of freedom.

This test is two-sided, because, under H1, parameters k1 and k2 are estimated independently (in particular, without the constraint k1 > k2). The exact distribution of LRT could be calculated via permutation techniques but the computation time would be tremendeous for large counts. We will then calculate the following asymptotic p-value:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M7">View MathML</a>

where n2 is the observed value of N2 and χ2 ~ χ2 (1).

Table 3 gives the LRT statistic and the associated p-value for the motif cagcgctg in E. coli. Remember that the LRT is two-sided, so pL have to be divided by two when compared to the one-sided binomial p-value pB. We see that the significances obtained with the LRT are different from the ones obtained with the exact binomial test, but the qualitative conclusions are the same.

Table 3. LRT statistic and associated p-value pL under different models for cagcgctg in the E. coli loops/backbone comparison.

Chi-square test

Another standard asymptotic test is the chi-square test where the counts Ni are compared to their expected values <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M8">View MathML</a>i under H0 given the total count N+:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M9">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M8">View MathML</a>1 = π0 N+ and <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M8">View MathML</a>2 = (1 - π0)N+. Under the null hypothesis, X2 has also an asymptotic chi-square distribution with one degree of freedom. It is also an intrinsically two-sided test. Further analyzes (including simulations) not presented here (see [14]) show that this test performs very similarly to the LRT in every situations. Note that the chi-square test is the same as the score test [13].

Discussion

LRT distribution

The chi-square distribution of the LRT statistic is only asymptotic, so the actual level may be different from the nominal one (typically α = 5%). To measure this difference, we have calculated this actual level for different values of π0 and N+. Since LRT is a function of N1, the actual level can be derived from the exact distribution of N1 given N+ which is binomial (see Results Section).

Figure 1 compares both levels (actual and nominal). Since the counts are discrete, the actual level can never be exactly α leading to oscillations in the plot. We see that the nominal level is only reached with N+ ≃ 1000 for π0 = 0.5 and even later for π0 = 0.95 (or π0 = 0.05). It means that the chi-square approximation of the LRT statistics is only valid for motifs with many total occurrences.

thumbnailFigure 1. Actual level (log scale) of the LRT as a function of N+ (log scale) for a nominal level α = 0.1%, 1%, 5% or 10% and probability π0 = 0.5 (left) and 0.95 (right). (Since the LRT test is two-sided, the right plot also holds for π0 = 0.05).

Regarding the motif cagcgctg, because π0 is about 15% (cf. Table 2), the picture is close to the right plot of Figure 1; In fact, with a total count of 143, the actual level is respectively 0.095%, 1.1%, 5.1% and 12.5% for a nominal level α equal to 0.1%, 1%, 5% and 10%.

LRT as a contrast measure

The LRT statistic can still be used as a contrast measure, i.e. a measure of the difference, between the two exceptionalities. For large values of N+ it is much faster and easier to compute than the binomial p-value. We will see in the illustration below that the two quantities are strongly correlated.

Decidability limits for the binomial test

Because the binomial test is exact, the actual and nominal levels are equal. The significance can then always be determined. It would be maximal when N1 = N+ (i.e. N2 = 0) and the corresponding p-value pB would be equal to <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M10">View MathML</a>. Therefore, if this minimal p-value is greater than the desired level α (typically 5%), no significance conclusion can be made. This happens when <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M10">View MathML</a>α, i.e. when N+ ≥ ln (α)/ln(π0).

Figure 2 gives this critical value of N+ for various values of π0 and α. We see, for instance, that for π0 = 0.7 and N+ = 10, one may get significant results at a level greater than 5% but not at a level smaller than 1%.

thumbnailFigure 2. Minimal count N+ to get a significant result with the binomial test as a function of the probability π0. Curves correspond to different levels α = 0.1%, 1%, 5% or 10% (from top to bottom).

Power

An important property for a statistical test is its ability to detect departure from the null hypothesis. This ability is measured by the power of the test which is the probability to exceed the significance threshold (defined under H0) when the true parameter satisfies H1. In our case, the parameter of interest is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M11">View MathML</a>

which is equal to π0 when k1 = k2. So the departure from H0 will be measured by the ratio k1/k2 when it differs from 1.

Exact binomial

Figure 3 presents the power of the exact binomial test when k1/k2 increases. As expected, the power increases with N+. Moreover, it decreases when π0 increases i.e. when the expected ratio ℓ1 μ1/(ℓ2 μ2) increases. It means that, when the motif is already expected to be much more frequent in sequence 1 than in sequence 2, it is more difficult to detect that its exceptionality in the first sequence is also higher.

thumbnailFigure 3. Power of the exact binomial test with level α = 5% as a function of k1/k2 (x-axis) for different values of π0. Curves correspond to different values of the total count N+ = 5 (dashed black), 10 (dashed red), 20 (dashed blue), 50 (dashed green), 100 (solid black), 500 (solid red) and 1000 (solid blue). Missing curves correspond to the values of N+ for which no significant results at level α = 5% can be obtained (cf. the Discussion Section).

The motif cagcgctg occurs N+ = 143 times in the whole genome. In the different models considered in Table 2, probability π0 is between 11.6% and 16.4%. The power of the binomial test in this case can therefore be read in Figure 3, in the two top plots between the black and red solid lines. We see that a ratio k1/k2 = 2 can be detected with probability greater than 90%, while a ratio of 1.5 will be detected with a bit more than 50% probability.

LRT

The same analysis can be made for the LRT tests. However, this only makes sense for sufficiently large N+, to guaranty the validity of the chi-square distribution.

Case of overlapping words

Compound Poisson model

The distribution of overlapping word occurrences is better modeled by a compound Poisson process (see [15]) in the following way:

• The word occurs in clumps distributed according to a Poisson process. The number of clumps Ci in sequence i is hence a random Poisson variable with mean denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M12">View MathML</a>i.

• The size Vic of the c-th clump (in sequence i) is random with geometric distribution:

Pr{Vic = v} = <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M13">View MathML</a> (1 - ai).

The clump sizes are supposed to be independent. Parameter ai is the overlapping probability of the motif and can be calculated under various Markovian models (see [5]).

In this setting, the count Ni is hence the sum of the sizes of Ci clumps and has the Polya-Aeppli (or geometric Poisson) distribution (see [12]). We have (see [5]) <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M12">View MathML</a>i = (1 - ai) λi. In the case of a non-overlapping word, we have Ci = Ni, ai = 0 and λi = λi. For overlapping words, the mean clump size is equal to 1/(1 - ai) and increases with ai.

Tests

An overlapping word can occur with an exceptional frequency (i) because of an exceptional number of clumps or (ii) because of exceptional sizes of clumps. Then comparing the exceptionalities of an overlapping word in two sequences leads to compare the number of clumps C1 with C2, and/or the sizes V1c's with V2c's.

Comparison of the number of clumps

In this compound Poisson model, the number of clumps in each sequence is Poisson distributed. The comparison of the counts C1 and C2 is then exactly equivalent to the comparison of the counts N1 and N2 studied in the Results Section, replacing λi by <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M12">View MathML</a>i and μi by <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M14">View MathML</a>i := (1 - ai) μi.

Exact test for the overlapping probability under M00

The question is now to test the null hypothesis H0 = {a1 = a2}. This comparison is made conditionally to the observed counts N1 and N2. It only makes sense if the motif occurs at least once in each sequence, i.e. if N1, N2, C1 and C2 are all larger than (or equal to) 1. In this case, the first occurrence necessarily corresponds to the first clump and the Ci - 1 last clumps have to be chosen among the other Ni - 1 motif occurrences. Since a motif occurrence (except the first one) corresponds to a clump occurrence with probability 1 - ai, the number of clumps (except the first one) has a binomial distribution:

Ci - 1 ~ <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M2">View MathML</a> (Ni - 1, 1 - ai)     (2)

which means that the expected number of clumps decreases when the overlapping probability increases.

Following the same strategy as for the non-overlapping case, we base our test on the distribution of C1 given the total clump count C+ = C1 + C2. Under H0, (C1 - 1) has an hyper-geometric distribution <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M15">View MathML</a> (N+ - 2, N1 - 1, C+ - 2) (see [12], Eq. (3.23)):

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M16">View MathML</a>

The overlapping probability a1 is then significantly greater than a2 if the probability Pr{C1 c1|N1, N2, C+} is smaller than a given level α.

Exact test in the general case

The previous test does not account for the composition of the sequences. The overlapping probabilities a1 and a2 can be expected to be different, according to some null model. In this case, the true overlapping probability in sequence i is bi = hi ai, where hi is an exceptionality coefficient (analogous to ki for the mean count). The problem is then to test H0 = {h1 = h2}. Such a test is proposed in Appendix: it involves the generalized negative hyper-geometric distribution.

Asymptotic tests

As for the counts N and C, asymptotic tests such as likelihood ratio, chi-square or score tests can be derived to compare exceptionalities in terms of overlaps. These tests are not presented here to avoid further statistical developments but also because the small overlapping probabilities generally observed make them rarely relevant.

Illustration

Materials

Comparing complete genomes of strains of single bacterial species allows to determine highly conserved regions (so-called backbone) and numerous strain-specific DNA segments (so-called loops) for each strain. These mosaic structures help to understand the evolution of bacterial genomes. Indeed, the backbone probably corresponds to the common ancestral strain and is under vertical pressure whereas the loops may be associated with mobile elements or strain-specific pathogenicity. Such backbone/loops segmentation has been systematically performed [9] and store in the public MOSAIC database [16]. We have extracted from this database the E. coli K-12 specific loops (sequence 1) and the backbone (sequence 2) obtained from the pairwise alignment of the complete genomes of E. coli K-12 laboratory strain and the enterohemorrhagic E. coli O157:H7 strain. As an illustration, we have compared the exceptionalities of all the 65536 octamers in the backbone versus in the loops. Such comparison will point out octamers which do not have the same constraint, with respect to their frequency, on the loops versus on the backbone.

Exact binomial test

Figure 4 presents the significance of the binomial test for all octamers in the backbone/loops comparison. The limits between the different significance levels are clear under M00 because the probability π0 is the same for all octamers, while they are fuzzy under M1 because π0 depends on the octamer composition. In this case, same counts (N1, N2) may result in different pB values. The distribution of the p-value pB is summarized in Table 4. The 10 motifs with smallest p-values, i.e. with an exceptionality coefficient significantly higher in the loops than in the backbone, are listed in the top of Table 5. Multiple testing problems arise when we compare the exceptionalities of the 65 536 octamers simultaneously. Table 6 gives the number of significant octamers and the corresponding threshold when adjusting for a False Discovery Rate (FDR, [17]) of 1%. For example, under model M1 only 154 octamers are significantly more exceptional in the loops. These octamers have all a p-value pB smaller than 2.2 10-5.

thumbnailFigure 4. Counts N2 (x-axis: backbone) and N1 (y-axis: loops) for all the octamers under the M00 (left) and M1 (right) models. The color indicates the significance of the binomial test in the M00 model: blue = 'pB > 0.01', green = 'pB < 0.01', yellow = 'pB < 0.001', red = 'pB < 0.0001'. The solid black line on the left indicates the expected ratio N1/N2 = π0/(1 - π0).

Table 4. Number of significantly unbalanced octamers under different models and for different thresholds.

Table 5. Top: 10 motifs with smallest p-value pB (kloops > kbackbone) for model M00, M0, M1 and M6. * indicates overlapping words. Bottom: 10 motifs with smallest p-value

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M17">View MathML</a>

(kbackbone > kloops).

Table 6. Top: numbers of octamers significantly more exceptional in the loops when adjusting for a False Discovery Rate of 1% and associated thresholds for the p-value pB for different models. Bottom: idem for octamers significantly more exceptional in the backbone.

Symmetrically, to find the motifs with an exceptionality coefficient significantly higher in the backbone than in the loops, we have to test H0 versus <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M18">View MathML</a> = {k2 > k1} using the p-value <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M17">View MathML</a> defined as <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M17">View MathML</a> = Pr{<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M2">View MathML</a> (n+, π0) ≤ n1}. The 10 most significant motifs for this test are given at the bottom of Table 5. When adjusting for a False Discovery Rate of 1%, only 14 octamers under model M1 are significantly more exceptional in the backbone than in the loops. These octamers have all a p-value <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M17">View MathML</a> smaller than 1.8 10-6. Note that under model M6, no octamer is significant after multiple testing adjustment.

According to the pB list, the motif cagcgctg has rank 1 115 among 65 536 under the M1 model. Note that the well known Chi motif (gctggtgg) which is the most overrepresented octamer in E. coli genome has a <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M17">View MathML</a> value of 5.1 10-5 (rank 1 281) under the same model; It means that kbackbone is significantly higher than kloops but due to multiple testing Chi is not among the significant octamers.

LRT versus binomial

We now compare the results provided by the two tests: binomial and LRT. Because the former is one-sided and the latter is two-sided, we use a signed version LRTs of the LRT statistic which is equal to LRT when N1 is greater than expected (N1 π0 N+) and to – LRT otherwise (N1 <π0 N+). To make the graph more readable, we also transform the p-value pB into a Gaussian score SB ∈ ℝ:

SB = Φ-1 (1 - pB)

where Φ is the cumulative distribution function of the standard Gaussian distribution. High positive values of SB correspond to motifs with an exceptionality coefficient in sequence 1 significantly higher than in sequence 2, while high negative values of SB correspond to motifs having an exceptionality coefficient in sequence 1 significantly lower than in sequence 2.

We see in Figure 5 that the two statistics give very similar results for all the octamers in the backbone/loops comparison. Table 7 gives the Spearman and Kendall correlation coefficients between the two statistics for different models. Recall that Spearman's coefficient is the correlation between the ranks, while Kendall's one is the proportion of concordant pairs between the two rankings. This confirms that the LRT statistics is a reliable exceptionality comparison score, although the associated p-value is questionable for small counts.

thumbnailFigure 5. Signed LRT statistic LRTs (y-axis) versus transformed binomial p-value Φ-1 (1 - pB) (x-axis) under model M1 for all octamers in the E. coli backbone/loops comparison.

Table 7. Spearman and Kendall correlation coefficients between LRTs and SB for different models.

Note that the naive comparison between the two p-values simply associated with the exceptionality of each motif in each sequence does not provide the same sets of significant octamers (see Figure 6). Such p-values have been calculated using the Poisson approximation of the number of clumps.

thumbnailFigure 6. Transformed binomial p-value Φ-1 (1 - pB) (y-axis) versus log ratio between the two p-values associated with the exceptionality of the motif in each sequence (x-axis) under model M1 for all octamers in the E. coli backbone/loops comparison.

Test for overlaps

Very few motifs have significant differences in their clumps sizes. Table 8 presents the results for the 4 motifs having a p-value smaller than 10%. For all of them, no overlap is observed in the backbone (C2 = N2 means that all clumps are of size 1 while few are observed in the loops (C1 <N1). The probability a is the overlapping probability under model M00.

Table 8. Octamers with significant differences in terms of overlaps in the backbone/loops comparison.

Conclusion

We have proposed two complementary statistical tests to compare the exceptionalities of motif counts in two sequences. The binomial test is exact and particularly of interest for small counts (from a computational point of view). For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test. The LRT statistics is simple to calculate and can be directly interpreted as a contrast measure between the exceptionalities; its p-value can be derived from the chi-square distribution. Both tests will be implemented in the R'MES software already devoted to exceptional motifs [18].

The likelihood ratio test can be generalized to more than two sequences. Suppose we want to compare I sequences S1, S2,..., SI. In each of them, we assume that the count Ni has a Poisson distribution with parameter λi = ki i μi and we want to test H0 = {k1 = k2 = ⋯ = kI} versus H1 = {At least one ki differs from the others}. The LRT statistics is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M19">View MathML</a>

Under H0, LRT has an asymptotic chi-square distribution with (I - 1) degrees of freedom. The Chi-square test can be generalized as well.

Under the Poisson model, both tests can be easily used for degenerated motifs or more generally for sets of motifs. Let denote by <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M20">View MathML</a> a set of motifs; The count Ni (respectively the occurrence probability μi) will be the sum of the counts (resp. occurrence probability) of w for all motifs w from <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M20">View MathML</a>. However, the generalization is much more involved for the compound Poisson model because of the possible overlaps between motifs from the set; In particular, the overlapping probability ai becomes a matrix [19].

We emphasize that these tests are valid only for independent sequences. They can not be used to detect skewed oligomers because the leading strand is not independent from the lagging strand [20]. This particular question requires the development of another rigorous statistical method; this is an ongoing work.

Finally, note that the exceptionality comparison of word counts in sequences is actually equivalent to the differential analysis of SAGE expression data [21]. Indeed, in the SAGE technology, the expression level of a given gene is measured by a number of associated tags and the problem is to compare the number of tags between two conditions. In such problem, no correction has to be done except for the total number of tags and our test statistics under model M00 are adapted.

Methods

Likelihood ratio test

The model presented in the Results Section can be rephrased as two Poisson processes with respective intensity ki ui (i = 1,2). To calculate the likelihood, we need to estimate the exceptionality coefficients k1 and k2. Under the alternative hypothesis, their respective maximum likelihood estimates (MLE) are <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M21">View MathML</a>1 = N1/(ℓ1 μ1) and <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M21">View MathML</a>2 = N2/(ℓ2 μ2). Assuming that the two sequences are independent, the log-likelihood of the two processes is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M22">View MathML</a>

Under the null hypothesis, the common MLE of k1 and k2 is <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M21">View MathML</a> = (N1 + N2)/(ℓ1 μ1 + ℓ2 μ2) and the log-likelihood is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M23">View MathML</a>

The LRT is defined as twice the difference between <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M24">View MathML</a>1 and <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M24">View MathML</a>0: LRT = 2(<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M24">View MathML</a>1 - <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M24">View MathML</a>0). The result follows after standard algebraic manipulations.

Appendix

Exact hyper-geometric test

Conditional distribution of the number of clumps

The conditional distribution of Ci - 1 given in (2) can be modified as

Ni - Ci ~ <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M2">View MathML</a> (Ni - 1, bi)

where bi = hiai is the true overlapping probability. This version is preferable, since the exceptionality coefficient hi directly appears here as a multiplicative constant. The conditional distribution of the difference Ni - Ci given the clump counts C1 and C2 and the total count N+ is a generalized negative hyper-geometric distribution (see [12] p. 264 for the classical version and p. 270 for the generalization):

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/84/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/84/mathml/M25">View MathML</a>

where A is the constant such that the sum over all n1 between C1 and N+ is equal to one.

Test

Under H0 = {h1 = h2}, the term b1/b2 can be replaced by a1/a2. The overlapping probability b1 is significantly greater than b2 if N1 is significantly large, i.e. if Pr{N1 n1|C1, C2, N+} is small. The power of this test can also be studied: under H0, b1/b2 equals a1/a2, while under the alternative hypothesis, it is equal to (h1/h2) (a1/a2). The power of the test is then a function of h1/h2.

Authors' contributions

SR and SS developed the statistical methodology, analyzed the examples and wrote the paper. VV studied the usage conditions. All authors read and approved the final manuscript.

Acknowledgements

We thank Meriem El Karoui and Marie-Agnès Petit for helpful discussions. We also thank the referees for their remarks. This work has been supported by the French Action Concertée Incitative IMPBio.

References

  1. van Helden J, André B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.

    J Mol Biol 1998, 281:827-842. PubMed Abstract | Publisher Full Text OpenURL

  2. El Karoui M, Biaudet V, Schbath S, Gruss A: Characteristics of Chi distribution on several bacterial genomes.

    Research in Microbiology 1999, 150:579-587. PubMed Abstract | Publisher Full Text OpenURL

  3. Bigot S, Saleh O, Lesterlin C, Pages C, El Karoui M, Dennis C, Grigoriev M, Allemand JF, Barre FX, Cornet F: KOPS: DNA motifs that control E. coli chromosome segregation by orienting the FtsK translocase.

    EMBO J 2005, 24:3770-3780. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Lothaire M: Applied Combinatorics on Words, Volume 105 of Encyclopedia of Mathematics and its Applications. Cambridge University Press; 2005. OpenURL

  5. Robin S, Rodolphe F, Schbath S: DNA Words and Models. Cambridge University Press; 2005.

    [English version of ADN, mots et modéles, BELIN 2003].

    OpenURL

  6. Davidsen T, Rodland E, Lagesen K, Seeberg E, Rognes T, Tonjum T: Biased distribution of DNA uptake sequences towards genome maintenance genes.

    Nucleic Acids Research 2004, 32:1050-1058. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Touzain F, Schbath S, Debled-Rennesson I, Aigle B, Leblond , Kucherov G: SIGffRid: Searching for transcription factor binding sites in bacterial genomes using comparative approach and biologically driven statistics.

    2006.

    [Preprint. Preliminary version in JOBIM 2005 proceedings, 417-426].

  8. Valens M, Penaud S, Rossignol M, Cornet F, Boccard F: Macrodomain organization of the Escherichia coli chromosome.

    EMBO J 2004, 23:4330-4341. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Chiapello H, Bourgait I, Sourivong F, Heuclin G, Jacquemard A, Petit MA, El Karoui M: Systematic determination of the MOSAIC structure – backbone versus strain specific loops – of bacterial genomes.

    BMC Bioinformatics 2005, 6:171. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  10. McNeil J, Smith K, Hall L, Lawrence J: Word frequency analysis reveals enrichment of dinucleotide repeats on the human X chromosome and [GATA]n in the X escape region.

    Genome Research 2006, 16:477-484. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Schbath S: Compound Poisson approximation of word counts in DNA sequences.

    ESAIM: Probability and Statistics 1995, 1:1-16. OpenURL

  12. Johnson NL, Kotz S, Kemp AW: Univariate Discrete Distributions. Wiley: New-York; 1992. OpenURL

  13. Armitage P, Colton T, (Eds): Encyclopedia of Biostatistics. Wiley; 1998. OpenURL

  14. Vandewalle V: Etude de motifs dans les séquences d'ADN : comparaison d'exceptionnalités. In Master's thesis. Institut National Agronomique Paris-Grignon; 2005. OpenURL

  15. Robin S: A compound Poisson model for words occurrences in DNA sequences.

    J R Statist Soc C 2002, 51:437-451. Publisher Full Text OpenURL

  16. MOSAIC[http://genome.jouy.inra.fr/mosaic/] webcite

  17. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerfull approach to multiple testing.

    JRSS B 1995, 57:289-300. OpenURL

  18. Hoebeke M, Schbath S: R'MES: Finding Exceptional Motifs. [http://genome.jouy.inra.fr/ssb/rmes/] webcite

    User guide, version 3 2006. OpenURL

  19. Roquain E, Schbath S: Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain.

    Adv Appl Prob 2007., 39 OpenURL

  20. Salzberg S, Salzberg A, Kerlavage A, Tomb JF: Skewed Oligomers and Origins of Replication.

    Gene 1998, 217:57-67. PubMed Abstract | Publisher Full Text OpenURL

  21. Audic S, Claverie JM: The significance of digital gene expression profiles.

    Genome Research 1997, 7:986-995. PubMed Abstract | Publisher Full Text OpenURL