Statistical tests to compare motif count exceptionalities

Robin, Stéphane; Schbath, Sophie; Vandewalle, Vincent

doi:10.1186/1471-2105-8-84

Methodology article
Open access
Published: 08 March 2007

Statistical tests to compare motif count exceptionalities

Stéphane Robin¹,
Sophie Schbath² &
Vincent Vandewalle¹

BMC Bioinformatics volume 8, Article number: 84 (2007) Cite this article

7835 Accesses
23 Citations
Metrics details

Abstract

Background

Finding over- or under-represented motifs in biological sequences is now a common task in genomics. Thanks to p-value calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count p-values in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required.

Results

We develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K-12 backbone versus variable strain-specific loops.

Conclusion

The exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use.

Background

Detecting motifs with a significantly unexpected frequency in DNA sequences has become a very common task in genome analysis. It is generally addressed to propose candidate functional motifs based on their statistical properties [1–3]. Lots of statistical methods have been developed to that purpose (see the recent surveys by [4] or [5] and references therein) and satisfactory solutions exist now to find exceptional motifs thanks to p-value calculations.

More recently, a new related question has arisen in the literature concerning the comparison of motif exceptionalities in two sequences. One wants for instance to compare particular sets of genes [6], upstream regions of CDSs versus whole chromosome [7], structural domains [8], CDSs versus intergenic regions, conserved regions versus strain-specific regions of bacterial genomes [9], or chromosomes from the same species [10]. Chromosomes from different species can also be compared from a comparative genomics point of view. In all these works, one would like to know if a given motif is significantly more exceptional in one sequence compared to another one. This criterion is usually used to identify motifs which are specific from some regions or expected to be more frequent in some particular parts of the genome. Transcription factor binding sites, for instance, are expected to be more frequent in upstream regions than along the whole genome.

Surprisingly, no rigorous statistical method has been proposed yet to decide if a given motif, exact or not, is significantly more exceptional in one sequence compared to a second one. Of course, two p-values can be calculated separately on each sequence to know if the motif is exceptional in these sequences but the difficult point is how to compare these two p-values from a statistical point of view. It is indeed not sufficient to make the difference or the ratio to know if the two p-values are significantly different; One needs a statistical test.

In this paper, we propose two statistical tests to compare the motif count exceptionalities in two independent sequences. In the Results Section, we first present the underlying model for motif occurrences and the null hypothesis to test, namely the motif is similarly exceptional in both sequences. Then we derive an exact binomial test and an asymptotic likelihood ratio test adapted for frequent motifs. Usage conditions and power of both tests are described in the Discussion Section, together with a more refined model for occurrences of overlapping words and the associated tests. An illustration of the method is finally given; We compare the octamer exceptionalities in two sets of regions (backbone/loops) from the Escherichia coli K12 leading strands. These two sets correspond to the mosaic structure of E. coli's genome when comparing the two strains K12 and O157:H7: the backbone represents the common regions whereas the loops are specific to the K12 strain. As a toy example all along this paper, we will treat in detail the case of the palindromic octamer cagcgctg which occurs respectively 30 times in the loops (758434 bps long) and 113 times in the backbone (3 882 513 bps long).

Results

Poisson model

In sequence i, the motif count N_iis supposed to have a Poisson distribution with mean (and variance) λ_i. This distribution has been shown to fit correctly theoretical (in Markovian sequences, for example) as well as observed count distributions of non-overlapping words [11]; A non-overlapping word is a word such that two occurrences of itself can not overlap in a sequence.

The mean count λ_iin sequence i must account for three parameters: (i) the length ℓ_iof the sequence, (ii) the composition of the sequence, (iii) the possible exceptionality of the motif in the sequence.

Expected intensity

The composition of the sequence can be accounted for via the probability μ_ifor the motif to occur at any position in the sequence under a simple model. The most popular models are Markov chain models which can fit the frequencies in mono-, di-, tri-nucleotides, etc. Indeed, the Markov chain model of order m (denoted by Mm) takes the (m + 1)-mer composition into account. Under such models, the occurrence probability μ_iof a h-letter motif w = w₁ w₂ ... w_hon the {a, c, g, t} alphabet can be expressed in terms of counts of its subwords of length m and m + 1 [5]. For instance, here are the expression of μ_iin models M0, M1 and M(h - 2) which fit respectively the composition in bases, in dinucleotides and in oligonucleotides of length h - 1:

\begin{array}{l} M 0 : & μ_{i} = \frac{\prod_{j = 1}^{h} N_{i} (w_{j})}{ℓ_{i}^{h}}, \\ M 1 : & μ_{i} = \frac{\prod_{j = 1}^{h - 1} N_{i} (w_{j} w_{j + 1})}{ℓ_{i} \prod_{j = 2}^{h - 1} N_{i} (w_{j})}, \\ M (h - 2) : & μ_{i} = \frac{N_{i} (w_{1} \dots w_{h - 1}) N_{i} (w_{2} \dots w_{h})}{(ℓ_{i} - h + 3) N_{i} (w_{2} \dots w_{h - 1})}, \end{array}

where N_i(·) denotes the count in sequence i.

If one does not want to account for the sequence composition (this case will be referred to as model M00), then μ_isimply depends on the motif, hence μ₁ = μ₂ = (1/4)^h.

The choice of the Markov chain model depends on the sequence composition one wants to fit. For instance, model M2 is often used for coding DNA sequences to take the codon bias into account. Higher the model order, better the fit, but usually the model order is bounded either by h - 2 or because the sequence is too small (the number of parameters to be estimated increases exponentially with the order).

Table 1 gives the expected counts ℓ_iμ_ifor the motif cagcgctg in the E. coli loops/backbone sequences. Since N₁ = 30 and N₂ = 113, we see that this motif is highly over-represented in both sequences under models M00, M0 and M1. However, under the richest possible model (M6), it is over-represented in sequence 1 (loops) but under-represented in sequence 2 (backbone).

Table 1 Expected count for cagcgctg in the loops (1) and in the backbone (2) of E. coli leading strands under different models.

Full size table

Exceptionality coefficient

When the motif is not exceptional with respect to the considered model, the mean count λ_iis simply ℓ_iμ_i. For exceptional motifs, i.e. motifs with an observed count N_ifar from its expectation ℓ_iμ_i, under a given model, the mean count λ_ishould reflect this exceptionality.

We therefore introduce an exceptionality coefficient k_iwhich allows λ_ito be greater (or smaller) than the expected value:

λ_i: = k_iℓ_iμ_i.

In the following, parameters ℓ_iand μ_iwill be supposed to be known a priori: they can be considered as two correction terms. The inference will only be made on k_i.

Hypothesis testing

Comparing the (potential) exceptionality of a motif in two sequences is equivalent to test the null hypothesis H₀ = {k₁ = k₂}.

We emphasize that the respective values of k₁ and k₂ can be larger than one (unexpectedly frequent motif), smaller than one (unexpectedly rare motif) or close to one (motif with expected count). These values do not matter: our only concern is to know if they are significantly different or not.

Exact binomial test

We first propose an exact test based on a general property of the Poisson distribution. If N₁ and N₂ are two independent Poisson counts with respective means λ₁ and λ₂, the distribution of N₁ given their sum N₊ : = N₁ + N₂ is binomial [12]: N₁ ~ $ℬ$ (N₊, π) with

π = \frac{λ_{1}}{λ_{1} + λ_{2}} = \frac{(k_{1} / k_{2}) ℓ_{1} μ_{1}}{(k_{1} / k_{2}) ℓ_{1} μ_{1} + ℓ_{2} μ_{2}} .

Under H₀, we have π = π₀ with

π_{0} = \frac{ℓ_{1} μ_{1}}{ℓ_{1} μ_{1} + ℓ_{2} μ_{2}} (1)

because k₁ = k₂. In absence of correction (M00 model) for the sequence composition (i.e. μ₁ = μ₂), we have π₀ = ℓ₁/(ℓ₁ + ℓ₂). If furthermore the two sequences have the same length, we get π₀ = 1/2.

Moreover, the proportion π and then the expectation of N₁, increases as the ratio k₁/k₂ increases. Therefore, the p-value for the one-sided alternative H₁ = {k₁ > k₂} is p_B= Pr { $ℬ$ (n₊, π₀) ≥ n₁}, i.e.

p_{B} = 1 - \sum_{d = 0}^{n_{1} - 1} (\begin{matrix} n_{+} \\ d \end{matrix}) π_{0}^{d} {(1 - π_{0})}^{n_{+} - d}

where n₊ and n₁ are the observed values of N₊ and N₁.

Table 2 gives the probability π₀ and the p-value p_Bfor the motif cagcgctg in E. coli. At level 5%, the null hypothesis is accepted under models M00 and M6 meaning that the motif is similarly exceptional in both sequences with respect to their length and/or 7-mer composition. However, {k₁ = k₂} is rejected at level 5% against {k₁ > k₂} under models M0 and M1; since cagcgctg is over-represented in both sequences, it means that it is significantly more exceptionally over-represented in sequence 1 (loops) with respect to the base and/or dinucleotide compositions of both sequences.

Table 2 Probability π₀ and p-value p_Bunder different models for cagcgctg in the E. coli loops/backbone comparison.

Full size table

Likelihood ratio test (LRT)

Another test statistic based on the comparison of the likelihood of the data under the H₀ and the alternative hypothesis H₁ = {k₁ ≠ k₂} can be derived. This statistic is known as the Likelihood Ratio Test (see [13], vol. IV). In our model (see the Methods Section), it is defined as

L R T = 2 [N_{1} \ln (\frac{N_{1} / N_{+}}{π_{0}}) + N_{2} \ln (\frac{N_{2} / N_{+}}{1 - π_{0}})]

where π₀ is defined in (1). Under the null hypothesis, its asymptotic distribution is a chi-square distribution with one degree of freedom.

This test is two-sided, because, under H₁, parameters k₁ and k₂ are estimated independently (in particular, without the constraint k₁ > k₂). The exact distribution of LRT could be calculated via permutation techniques but the computation time would be tremendeous for large counts. We will then calculate the following asymptotic p-value:

p_{L} = \Pr {χ^{2} \geq 2 [n_{1} \ln (\frac{n_{1} / n_{+}}{π_{0}}) + n_{2} \ln (\frac{n_{2} / n_{+}}{1 - π_{0}})]},

where n₂ is the observed value of N₂ and χ² ~ χ² (1).

Table 3 gives the LRT statistic and the associated p-value for the motif cagcgctg in E. coli. Remember that the LRT is two-sided, so p_Lhave to be divided by two when compared to the one-sided binomial p-value p_B. We see that the significances obtained with the LRT are different from the ones obtained with the exact binomial test, but the qualitative conclusions are the same.

Table 3 LRT statistic and associated p-value p_Lunder different models for cagcgctg in the E. coli loops/backbone comparison.

Full size table

Chi-square test

Another standard asymptotic test is the chi-square test where the counts N_iare compared to their expected values $\hat{N}$ _iunder H₀ given the total count N₊:

X^{2} = \sum_{i = 1}^{2} \frac{{(N_{i} - {\hat{N}}_{i})}^{2}}{{\hat{N}}_{i}}

where $\hat{N}$ ₁ = π₀ N₊ and $\hat{N}$ ₂ = (1 - π₀)N₊. Under the null hypothesis, X² has also an asymptotic chi-square distribution with one degree of freedom. It is also an intrinsically two-sided test. Further analyzes (including simulations) not presented here (see [14]) show that this test performs very similarly to the LRT in every situations. Note that the chi-square test is the same as the score test [13].

Discussion

LRT distribution

The chi-square distribution of the LRT statistic is only asymptotic, so the actual level may be different from the nominal one (typically α = 5%). To measure this difference, we have calculated this actual level for different values of π₀ and N₊. Since LRT is a function of N₁, the actual level can be derived from the exact distribution of N₁ given N₊ which is binomial (see Results Section).

Figure 1 compares both levels (actual and nominal). Since the counts are discrete, the actual level can never be exactly α leading to oscillations in the plot. We see that the nominal level is only reached with N₊ ≃ 1000 for π₀ = 0.5 and even later for π₀ = 0.95 (or π₀ = 0.05). It means that the chi-square approximation of the LRT statistics is only valid for motifs with many total occurrences.

Regarding the motif cagcgctg, because π₀ is about 15% (cf. Table 2), the picture is close to the right plot of Figure 1; In fact, with a total count of 143, the actual level is respectively 0.095%, 1.1%, 5.1% and 12.5% for a nominal level α equal to 0.1%, 1%, 5% and 10%.

LRT as a contrast measure

The LRT statistic can still be used as a contrast measure, i.e. a measure of the difference, between the two exceptionalities. For large values of N₊ it is much faster and easier to compute than the binomial p-value. We will see in the illustration below that the two quantities are strongly correlated.

Decidability limits for the binomial test

Because the binomial test is exact, the actual and nominal levels are equal. The significance can then always be determined. It would be maximal when N₁ = N₊ (i.e. N₂ = 0) and the corresponding p-value p_Bwould be equal to $π_{0}^{N_{+}}$ . Therefore, if this minimal p-value is greater than the desired level α (typically 5%), no significance conclusion can be made. This happens when $π_{0}^{N_{+}}$ α, i.e. when N₊ ≥ ln (α)/ln(π₀).

Figure 2 gives this critical value of N₊ for various values of π₀ and α. We see, for instance, that for π₀ = 0.7 and N₊ = 10, one may get significant results at a level greater than 5% but not at a level smaller than 1%.

Power

An important property for a statistical test is its ability to detect departure from the null hypothesis. This ability is measured by the power of the test which is the probability to exceed the significance threshold (defined under H₀) when the true parameter satisfies H₁. In our case, the parameter of interest is

π = \frac{λ_{1}}{λ_{1} + λ_{2}} = \frac{(k_{1} / k_{2}) ℓ_{1} μ_{1}}{(k_{1} / k_{2}) ℓ_{1} μ_{1} + ℓ_{2} μ_{2}}

which is equal to π₀ when k₁ = k₂. So the departure from H₀ will be measured by the ratio k₁/k₂ when it differs from 1.

Exact binomial

Figure 3 presents the power of the exact binomial test when k₁/k₂ increases. As expected, the power increases with N₊. Moreover, it decreases when π₀ increases i.e. when the expected ratio ℓ₁ μ₁/(ℓ₂ μ₂) increases. It means that, when the motif is already expected to be much more frequent in sequence 1 than in sequence 2, it is more difficult to detect that its exceptionality in the first sequence is also higher.

The motif cagcgctg occurs N₊ = 143 times in the whole genome. In the different models considered in Table 2, probability π₀ is between 11.6% and 16.4%. The power of the binomial test in this case can therefore be read in Figure 3, in the two top plots between the black and red solid lines. We see that a ratio k₁/k₂ = 2 can be detected with probability greater than 90%, while a ratio of 1.5 will be detected with a bit more than 50% probability.

LRT

The same analysis can be made for the LRT tests. However, this only makes sense for sufficiently large N₊, to guaranty the validity of the chi-square distribution.

Case of overlapping words

Compound Poisson model

The distribution of overlapping word occurrences is better modeled by a compound Poisson process (see [15]) in the following way:

The word occurs in clumps distributed according to a Poisson process. The number of clumps C_iin sequence i is hence a random Poisson variable with mean denoted by $\tilde{λ}$ _i.
The size V_icof the c-th clump (in sequence i) is random with geometric distribution:

Pr{V_ic= v} = $a_{i}^{v - 1}$ (1 - a_i).

The clump sizes are supposed to be independent. Parameter a_iis the overlapping probability of the motif and can be calculated under various Markovian models (see [5]).

In this setting, the count N_iis hence the sum of the sizes of C_iclumps and has the Polya-Aeppli (or geometric Poisson) distribution (see [12]). We have (see [5]) $\tilde{λ}$ _i= (1 - a_i) λ_i. In the case of a non-overlapping word, we have C_i= N_i, a_i= 0 and λ_i= λ_i. For overlapping words, the mean clump size is equal to 1/(1 - a_i) and increases with a_i.

Tests

An overlapping word can occur with an exceptional frequency (i) because of an exceptional number of clumps or (ii) because of exceptional sizes of clumps. Then comparing the exceptionalities of an overlapping word in two sequences leads to compare the number of clumps C₁ with C₂, and/or the sizes V_1c's with V_2c's.

Comparison of the number of clumps

In this compound Poisson model, the number of clumps in each sequence is Poisson distributed. The comparison of the counts C₁ and C₂ is then exactly equivalent to the comparison of the counts N₁ and N₂ studied in the Results Section, replacing λ_iby $\tilde{λ}$ _iand μ_iby $\tilde{μ}$ _i:= (1 - a_i) μ_i.

Exact test for the overlapping probability under M00

The question is now to test the null hypothesis H₀ = {a₁ = a₂}. This comparison is made conditionally to the observed counts N₁ and N₂. It only makes sense if the motif occurs at least once in each sequence, i.e. if N₁, N₂, C₁ and C₂ are all larger than (or equal to) 1. In this case, the first occurrence necessarily corresponds to the first clump and the C_i- 1 last clumps have to be chosen among the other N_i- 1 motif occurrences. Since a motif occurrence (except the first one) corresponds to a clump occurrence with probability 1 - a_i, the number of clumps (except the first one) has a binomial distribution:

C_i- 1 ~ $ℬ$ (N_i- 1, 1 - a_i) (2)

which means that the expected number of clumps decreases when the overlapping probability increases.

Following the same strategy as for the non-overlapping case, we base our test on the distribution of C₁ given the total clump count C₊ = C₁ + C₂. Under H₀, (C₁ - 1) has an hyper-geometric distribution $ℋ$ (N₊ - 2, N₁ - 1, C₊ - 2) (see [12], Eq. (3.23)):

\begin{matrix} \Pr {C_{1} = c_{1} | N_{1}, N_{2}, C_{+}} = \\ \frac{(\begin{matrix} N_{1} - 1 \\ c_{1} - 1 \end{matrix}) (\begin{matrix} N_{2} - 1 \\ C_{+} - c_{1} - 1 \end{matrix})}{(\begin{matrix} N_{+} - 2 \\ C_{+} - 2 \end{matrix})} \end{matrix}

The overlapping probability a₁ is then significantly greater than a₂ if the probability Pr{C₁ ≤ c₁|N₁, N₂, C₊} is smaller than a given level α.

Exact test in the general case

The previous test does not account for the composition of the sequences. The overlapping probabilities a₁ and a₂ can be expected to be different, according to some null model. In this case, the true overlapping probability in sequence i is b_i= h_ia_i, where h_iis an exceptionality coefficient (analogous to k_ifor the mean count). The problem is then to test H₀ = {h₁ = h₂}. Such a test is proposed in Appendix: it involves the generalized negative hyper-geometric distribution.

Asymptotic tests

As for the counts N and C, asymptotic tests such as likelihood ratio, chi-square or score tests can be derived to compare exceptionalities in terms of overlaps. These tests are not presented here to avoid further statistical developments but also because the small overlapping probabilities generally observed make them rarely relevant.

Illustration

Materials

Comparing complete genomes of strains of single bacterial species allows to determine highly conserved regions (so-called backbone) and numerous strain-specific DNA segments (so-called loops) for each strain. These mosaic structures help to understand the evolution of bacterial genomes. Indeed, the backbone probably corresponds to the common ancestral strain and is under vertical pressure whereas the loops may be associated with mobile elements or strain-specific pathogenicity. Such backbone/loops segmentation has been systematically performed [9] and store in the public MOSAIC database [16]. We have extracted from this database the E. coli K-12 specific loops (sequence 1) and the backbone (sequence 2) obtained from the pairwise alignment of the complete genomes of E. coli K-12 laboratory strain and the enterohemorrhagic E. coli O157:H7 strain. As an illustration, we have compared the exceptionalities of all the 65536 octamers in the backbone versus in the loops. Such comparison will point out octamers which do not have the same constraint, with respect to their frequency, on the loops versus on the backbone.

Exact binomial test

Figure 4 presents the significance of the binomial test for all octamers in the backbone/loops comparison. The limits between the different significance levels are clear under M00 because the probability π₀ is the same for all octamers, while they are fuzzy under M1 because π₀ depends on the octamer composition. In this case, same counts (N₁, N₂) may result in different p_Bvalues. The distribution of the p-value p_Bis summarized in Table 4. The 10 motifs with smallest p-values, i.e. with an exceptionality coefficient significantly higher in the loops than in the backbone, are listed in the top of Table 5. Multiple testing problems arise when we compare the exceptionalities of the 65 536 octamers simultaneously. Table 6 gives the number of significant octamers and the corresponding threshold when adjusting for a False Discovery Rate (FDR, [17]) of 1%. For example, under model M1 only 154 octamers are significantly more exceptional in the loops. These octamers have all a p-value p_Bsmaller than 2.2 10^-5.

Table 4 Number of significantly unbalanced octamers under different models and for different thresholds.

Full size table

Table 5 Top: 10 motifs with smallest p-value p_B(k_loops > k_backbone) for model M00, M0, M1 and M6. * indicates overlapping words. Bottom: 10 motifs with smallest p-value ${p^{'}}_{B}$ (k_backbone > k_loops).

Full size table

Table 6 Top: numbers of octamers significantly more exceptional in the loops when adjusting for a False Discovery Rate of 1% and associated thresholds for the p-value p_Bfor different models. Bottom: idem for octamers significantly more exceptional in the backbone.

Full size table

Symmetrically, to find the motifs with an exceptionality coefficient significantly higher in the backbone than in the loops, we have to test H₀ versus ${H^{'}}_{1}$ = {k₂ > k₁} using the p-value ${p^{'}}_{B}$ defined as ${p^{'}}_{B}$ = Pr{ $ℬ$ (n₊, π₀) ≤ n₁}. The 10 most significant motifs for this test are given at the bottom of Table 5. When adjusting for a False Discovery Rate of 1%, only 14 octamers under model M1 are significantly more exceptional in the backbone than in the loops. These octamers have all a p-value ${p^{'}}_{B}$ smaller than 1.8 10^-6. Note that under model M6, no octamer is significant after multiple testing adjustment.

According to the p_Blist, the motif cagcgctg has rank 1 115 among 65 536 under the M1 model. Note that the well known Chi motif (gctggtgg) which is the most overrepresented octamer in E. coli genome has a ${p^{'}}_{B}$ value of 5.1 10^-5 (rank 1 281) under the same model; It means that k_backbone is significantly higher than k_loops but due to multiple testing Chi is not among the significant octamers.

LRT versus binomial

We now compare the results provided by the two tests: binomial and LRT. Because the former is one-sided and the latter is two-sided, we use a signed version LRT^sof the LRT statistic which is equal to LRT when N₁ is greater than expected (N₁ ≥ π₀ N₊) and to – LRT otherwise (N₁ <π₀ N₊). To make the graph more readable, we also transform the p-value p_Binto a Gaussian score S_B∈ ℝ:

S_B= Φ^-1 (1 - p_B)

where Φ is the cumulative distribution function of the standard Gaussian distribution. High positive values of S_Bcorrespond to motifs with an exceptionality coefficient in sequence 1 significantly higher than in sequence 2, while high negative values of S_Bcorrespond to motifs having an exceptionality coefficient in sequence 1 significantly lower than in sequence 2.

We see in Figure 5 that the two statistics give very similar results for all the octamers in the backbone/loops comparison. Table 7 gives the Spearman and Kendall correlation coefficients between the two statistics for different models. Recall that Spearman's coefficient is the correlation between the ranks, while Kendall's one is the proportion of concordant pairs between the two rankings. This confirms that the LRT statistics is a reliable exceptionality comparison score, although the associated p-value is questionable for small counts.

Table 7 Spearman and Kendall correlation coefficients between LRT^sand S_Bfor different models.

Full size table

Note that the naive comparison between the two p-values simply associated with the exceptionality of each motif in each sequence does not provide the same sets of significant octamers (see Figure 6). Such p-values have been calculated using the Poisson approximation of the number of clumps.

Test for overlaps

Very few motifs have significant differences in their clumps sizes. Table 8 presents the results for the 4 motifs having a p-value smaller than 10%. For all of them, no overlap is observed in the backbone (C₂ = N₂ means that all clumps are of size 1 while few are observed in the loops (C₁ <N₁). The probability a is the overlapping probability under model M00.

Table 8 Octamers with significant differences in terms of overlaps in the backbone/loops comparison.

Full size table

Conclusion

We have proposed two complementary statistical tests to compare the exceptionalities of motif counts in two sequences. The binomial test is exact and particularly of interest for small counts (from a computational point of view). For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test. The LRT statistics is simple to calculate and can be directly interpreted as a contrast measure between the exceptionalities; its p-value can be derived from the chi-square distribution. Both tests will be implemented in the R'MES software already devoted to exceptional motifs [18].

The likelihood ratio test can be generalized to more than two sequences. Suppose we want to compare I sequences S₁, S₂,..., S_I. In each of them, we assume that the count N_ihas a Poisson distribution with parameter λ_i= k_iℓ_iμ_iand we want to test H₀ = {k₁ = k₂ = ⋯ = k_I} versus H₁ = {At least one k_idiffers from the others}. The LRT statistics is

L R T = 2 \sum_{i} N_{i} \ln (\frac{N_{i}}{N_{+}} \frac{\sum_{j} μ_{j} ℓ_{j}}{μ_{i} ℓ_{i}}) .

Under H₀, LRT has an asymptotic chi-square distribution with (I - 1) degrees of freedom. The Chi-square test can be generalized as well.

Under the Poisson model, both tests can be easily used for degenerated motifs or more generally for sets of motifs. Let denote by $W$ a set of motifs; The count N_i(respectively the occurrence probability μ_i) will be the sum of the counts (resp. occurrence probability) of w for all motifs w from $W$ . However, the generalization is much more involved for the compound Poisson model because of the possible overlaps between motifs from the set; In particular, the overlapping probability a_ibecomes a matrix [19].

We emphasize that these tests are valid only for independent sequences. They can not be used to detect skewed oligomers because the leading strand is not independent from the lagging strand [20]. This particular question requires the development of another rigorous statistical method; this is an ongoing work.

Finally, note that the exceptionality comparison of word counts in sequences is actually equivalent to the differential analysis of SAGE expression data [21]. Indeed, in the SAGE technology, the expression level of a given gene is measured by a number of associated tags and the problem is to compare the number of tags between two conditions. In such problem, no correction has to be done except for the total number of tags and our test statistics under model M00 are adapted.

Methods

Likelihood ratio test

The model presented in the Results Section can be rephrased as two Poisson processes with respective intensity k_iu_i(i = 1,2). To calculate the likelihood, we need to estimate the exceptionality coefficients k₁ and k₂. Under the alternative hypothesis, their respective maximum likelihood estimates (MLE) are $\hat{k}$ ₁ = N₁/(ℓ₁ μ₁) and $\hat{k}$ ₂ = N₂/(ℓ₂ μ₂). Assuming that the two sequences are independent, the log-likelihood of the two processes is

\begin{matrix} ℒ_{1} = \sum_{i = 1}^{2} [N_{i} \ln ({\hat{k}}_{i} μ_{i}) - {\hat{k}}_{i} μ_{i} ℓ_{i}] \\ = \sum_{i = 1}^{2} [N_{i} \ln (N_{i} / ℓ_{i}) - N_{i}] . \end{matrix}

Under the null hypothesis, the common MLE of k₁ and k₂ is $\hat{k}$ = (N₁ + N₂)/(ℓ₁ μ₁ + ℓ₂ μ₂) and the log-likelihood is

ℒ_{0} = \sum_{i = 1}^{2} [N_{i} \ln (\hat{k} μ_{i}) - \hat{k} μ_{i} ℓ_{i}] = \sum_{i = 1}^{2} [N_{i} \ln (\hat{k} μ_{i}) - N_{i}] .

The LRT is defined as twice the difference between $ℒ$ ₁ and $ℒ$ ₀: LRT = 2( $ℒ$ ₁ - $ℒ$ ₀). The result follows after standard algebraic manipulations.

Appendix

Exact hyper-geometric test

Conditional distribution of the number of clumps

The conditional distribution of C_i- 1 given in (2) can be modified as

N_i- C_i~ $ℬ$ (N_i- 1, b_i)

where b_i= h_ia_iis the true overlapping probability. This version is preferable, since the exceptionality coefficient h_idirectly appears here as a multiplicative constant. The conditional distribution of the difference N_i- C_igiven the clump counts C₁ and C₂ and the total count N₊ is a generalized negative hyper-geometric distribution (see [12] p. 264 for the classical version and p. 270 for the generalization):

\begin{matrix} \Pr {N_{1} = n_{1} | C_{1}, C_{2}, N_{+}} = \\ A^{- 1} \frac{(\begin{matrix} n_{1} - 1 \\ C_{1} - 1 \end{matrix}) (\begin{matrix} N_{+} - n_{1} - 1 \\ C_{2} - 1 \end{matrix})}{(\begin{matrix} N_{+} - 2 \\ C_{+} - 2 \end{matrix})} {(\frac{b_{1}}{b_{2}})}^{n_{1} - C_{1}} \end{matrix}

where A is the constant such that the sum over all n₁ between C₁ and N₊ is equal to one.

Test

Under H₀ = {h₁ = h₂}, the term b₁/b₂ can be replaced by a₁/a₂. The overlapping probability b₁ is significantly greater than b₂ if N₁ is significantly large, i.e. if Pr{N₁ ≥ n₁|C₁, C₂, N₊} is small. The power of this test can also be studied: under H₀, b₁/b₂ equals a₁/a₂, while under the alternative hypothesis, it is equal to (h₁/h₂) (a₁/a₂). The power of the test is then a function of h₁/h₂.

References

van Helden J, André B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 1998, 281: 827–842. 10.1006/jmbi.1998.1947
Article CAS PubMed Google Scholar
El Karoui M, Biaudet V, Schbath S, Gruss A: Characteristics of Chi distribution on several bacterial genomes. Research in Microbiology 1999, 150: 579–587. 10.1016/S0923-2508(99)00132-1
Article CAS PubMed Google Scholar
Bigot S, Saleh O, Lesterlin C, Pages C, El Karoui M, Dennis C, Grigoriev M, Allemand JF, Barre FX, Cornet F: KOPS: DNA motifs that control E. coli chromosome segregation by orienting the FtsK translocase. EMBO J 2005, 24: 3770–3780. 10.1038/sj.emboj.7600835
Article PubMed Central CAS PubMed Google Scholar
Lothaire M: Applied Combinatorics on Words, Volume 105 of Encyclopedia of Mathematics and its Applications. Cambridge University Press; 2005.
Book Google Scholar
Robin S, Rodolphe F, Schbath S: DNA Words and Models. Cambridge University Press; 2005. [English version of ADN, mots et modéles, BELIN 2003]. [English version of ADN, mots et modéles, BELIN 2003].
Google Scholar
Davidsen T, Rodland E, Lagesen K, Seeberg E, Rognes T, Tonjum T: Biased distribution of DNA uptake sequences towards genome maintenance genes. Nucleic Acids Research 2004, 32: 1050–1058. 10.1093/nar/gkh255
Article PubMed Central CAS PubMed Google Scholar
Touzain F, Schbath S, Debled-Rennesson I, Aigle B, Leblond , Kucherov G: SIGffRid: Searching for transcription factor binding sites in bacterial genomes using comparative approach and biologically driven statistics. 2006. [Preprint. Preliminary version in JOBIM 2005 proceedings, 417-426].
Google Scholar
Valens M, Penaud S, Rossignol M, Cornet F, Boccard F: Macrodomain organization of the Escherichia coli chromosome. EMBO J 2004, 23: 4330–4341. 10.1038/sj.emboj.7600434
Article PubMed Central CAS PubMed Google Scholar
Chiapello H, Bourgait I, Sourivong F, Heuclin G, Jacquemard A, Petit MA, El Karoui M: Systematic determination of the MOSAIC structure – backbone versus strain specific loops – of bacterial genomes. BMC Bioinformatics 2005, 6: 171. 10.1186/1471-2105-6-171
Article PubMed Central CAS PubMed Google Scholar
McNeil J, Smith K, Hall L, Lawrence J: Word frequency analysis reveals enrichment of dinucleotide repeats on the human X chromosome and [GATA]n in the X escape region. Genome Research 2006, 16: 477–484. 10.1101/gr.4627606
Article PubMed Central CAS PubMed Google Scholar
Schbath S: Compound Poisson approximation of word counts in DNA sequences. ESAIM: Probability and Statistics 1995, 1: 1–16.
Article Google Scholar
Johnson NL, Kotz S, Kemp AW: Univariate Discrete Distributions. Wiley: New-York; 1992.
Google Scholar
Armitage P, Colton T, (Eds): Encyclopedia of Biostatistics. Wiley; 1998.
Google Scholar
Vandewalle V: Etude de motifs dans les séquences d'ADN : comparaison d'exceptionnalités. In Master's thesis. Institut National Agronomique Paris-Grignon; 2005.
Google Scholar
Robin S: A compound Poisson model for words occurrences in DNA sequences. J R Statist Soc C 2002, 51: 437–451. 10.1111/1467-9876.00279
Article Google Scholar
MOSAIC[http://genome.jouy.inra.fr/mosaic/]
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerfull approach to multiple testing. JRSS B 1995, 57: 289–300.
Google Scholar
Hoebeke M, Schbath S: R'MES: Finding Exceptional Motifs. User guide, version 3 2006. [http://genome.jouy.inra.fr/ssb/rmes/]
Google Scholar
Roquain E, Schbath S: Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain. Adv Appl Prob 2007., 39:
Google Scholar
Salzberg S, Salzberg A, Kerlavage A, Tomb JF: Skewed Oligomers and Origins of Replication. Gene 1998, 217: 57–67. 10.1016/S0378-1119(98)00374-6
Article CAS PubMed Google Scholar
Audic S, Claverie JM: The significance of digital gene expression profiles. Genome Research 1997, 7: 986–995.
CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Meriem El Karoui and Marie-Agnès Petit for helpful discussions. We also thank the referees for their remarks. This work has been supported by the French Action Concertée Incitative IMPBio.

Author information

Authors and Affiliations

INA PG/ENGREF/INRA, UMR518 Unité Mathématiques et Informatique Appliquées, 75005, Paris, France
Stéphane Robin & Vincent Vandewalle
INRA, UR1077 Unité Mathématique, Informatique et Génome, 78350, Jouy-en-Josas, France
Sophie Schbath

Authors

Stéphane Robin
View author publications
You can also search for this author in PubMed Google Scholar
Sophie Schbath
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Vandewalle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Stéphane Robin or Sophie Schbath.

Additional information

Authors' contributions

SR and SS developed the statistical methodology, analyzed the examples and wrote the paper. VV studied the usage conditions. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Robin, S., Schbath, S. & Vandewalle, V. Statistical tests to compare motif count exceptionalities. BMC Bioinformatics 8, 84 (2007). https://doi.org/10.1186/1471-2105-8-84

Download citation

Received: 15 September 2006
Accepted: 08 March 2007
Published: 08 March 2007
DOI: https://doi.org/10.1186/1471-2105-8-84

Statistical tests to compare motif count exceptionalities

Abstract

Background

Results

Conclusion

Background

Results

Poisson model

Expected intensity

Exceptionality coefficient

Hypothesis testing

Exact binomial test

Likelihood ratio test (LRT)

Chi-square test

Discussion

LRT distribution

LRT as a contrast measure

Decidability limits for the binomial test

Power

Exact binomial

LRT

Case of overlapping words

Compound Poisson model

Tests

Comparison of the number of clumps

Exact test for the overlapping probability under M00

Exact test in the general case

Asymptotic tests

Illustration

Materials

Exact binomial test

LRT versus binomial

Test for overlaps

Conclusion

Methods

Likelihood ratio test

Appendix

Exact hyper-geometric test

Conditional distribution of the number of clumps

Test

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us