Skip to main content
  • Methodology article
  • Open access
  • Published:

Compo: composite motif discovery using discrete models

Abstract

Background

Computational discovery of motifs in biomolecular sequences is an established field, with applications both in the discovery of functional sites in proteins and regulatory sites in DNA. In recent years there has been increased attention towards the discovery of composite motifs, typically occurring in cis-regulatory regions of genes.

Results

This paper describes Compo: a discrete approach to composite motif discovery that supports richer modeling of composite motifs and a more realistic background model compared to previous methods. Furthermore, multiple parameter and threshold settings are tested automatically, and the most interesting motifs across settings are selected. This avoids reliance on single hard thresholds, which has been a weakness of previous discrete methods. Comparison of motifs across parameter settings is made possible by the use of p-values as a general significance measure. Compo can either return an ordered list of motifs, ranked according to the general significance measure, or a Pareto front corresponding to a multi-objective evaluation on sensitivity, specificity and spatial clustering.

Conclusion

Compo performs very competitively compared to several existing methods on a collection of benchmark data sets. These benchmarks include a recently published, large benchmark suite where the use of support across sequences allows Compo to correctly identify binding sites even when the relevant PWMs are mixed with a large number of noise PWMs. Furthermore, the possibility of parameter-free running offers high usability, the support for multi-objective evaluation allows a rich view of potential regulators, and the discrete model allows flexibility in modeling and interpretation of motifs.

Background

Computational discovery of motifs corresponding to functional sites in proteins or binding sites in DNA is an established field within bioinformatics. In particular, the discovery of transcription factor binding sites in DNA has received much attention. Experimental identification of binding sites is a tedious process. Given the ever increasing number of genomes that are sequenced, computational identification of regulatory elements is needed to speed up the annotation process.

A typical approach for motif discovery is to use regulatory (promoter) regions for genes that are believed to be co-regulated as input, and try to predict individual DNA binding sites and possibly associated transcription factors that can explain the co-regulation. Typical software tools are MEME [1] and AlignACE [2]. This has turned out to be a very challenging problem. In particular the large number of false positive binding sites predicted by most methods represents a problem [3]. One promising improvement to this strategy is to search for combinations of binding sites, rather than individual occurrences.

Gene regulation usually has a combinatorial complexity [4], i.e. a combination of transcription factors (TFs) is often needed for active regulation. These TFs may be co-acting either directly through physical contact or indirectly through additional factors. As co-acting TFs may be expected to be in physical proximity, their binding sites are often clustered in sequence space. However, this is not a strict requirement as the DNA strand may form loops between distant sites [5]. Also, a given regulatory region may contain several possibly independent subsets of TF binding sites, representing alternative regulatory contexts. Clusters of binding sites involved in co-regulation are often referred to as cis-regulatory modules (CRMs), composite motifs or structured motifs [6], and they usually contain binding sites for a few TFs [7]. In this paper we refer to the model of a binding site of an individual TF as a single motif, and a given set of single motifs as a composite motif. We also use the term module when we want to emphasize the biological aspects of the TF combination.

Several computational methods have been developed for the discovery of composite motifs [8]. One line of methods, often called de novo module discovery, tries to find composite motifs using only DNA sequences as input data (e.g. CisModule [9], LOGOS [10], EMCMODULE [11]). This is a notoriously difficult problem, in many cases with close to random performance. However, biologists will often have some prior knowledge about potential regulators for the sequences of interest. Therefore another line of methods takes a list of single motifs as input along with the sequence data (e.g. Cister [12], ModuleSearcher [7], MScan [13]). These methods can also be used in a de novo setting by first finding candidate motifs using a single motif discovery method, and then running composite motif discovery with the candidate motifs as input. The differences between composite motif discovery methods lie mainly in 1) how single motifs and inter-motif distance conservation are modeled, 2) how motifs are evaluated and ranked, and 3) how the search space of composite motifs is explored. Composite motifs can be modeled in a discrete or probabilistic framework. Discrete methods typically use a set-model for composite motifs, requiring all single motifs to occur in a composite motif instance [14, 15]. This is typically combined with a discrete model for inter-motif distance restriction, the most common approach being a window model that requires all motifs to occur within a sequence window of given length, but without any constraints on internal order or distances between single motifs (e.g. [7, 16]). The discrete approach has several advantages, such as efficient inference and straightforward interpretation of motifs, and often an exhaustive mapping of the search space is possible. However, the reliance on hard thresholds for discretization may pose problems because of uncertainty and variability of TF binding. Therefore, the recent trend has been towards probabilistic models of composite motifs. Hidden Markov Models (HMMs) have often been used [10, 12, 17], typically containing different states for each single motif as well as for intra- and inter-module gaps. However, given the advantages of discrete methods we believe that they still can be a useful supplement and alternative to probabilistic methods.

This paper describes a new method Compo, which revisits the discrete approach to composite motif discovery. Compo relaxes the limitation of hard discretization thresholds by using multiple threshold values. By using p-values as a general significance measure, comparison of motifs across threshold settings becomes possible and thus automatic selection of the most interesting motifs across several threshold values. Furthermore, the automatic selection across parameter values means that Compo is able to infer properties of composite motif structure. Compo is therefore able to exploit overrepresentation across co-regulated sequences for improved composite motif detection. Although parameter inference from data is also possible with models such as HMMs, most proposed methods only scan HMMs against target sequence, using fixed parameters for module structure (e.g. [12, 17, 18]). This is basically equivalent to a single-sequence approach. Compo supports a richer composite motif model than previous discrete methods.

In addition to the standard set-model of component motifs, it optionally allows some component motifs to be missing (fault-tolerance) in composite instances, and distance restrictions on composite instances can optionally be enforced. As motif significance is computed as p-values for all supported models, the significance of composite motifs having different structure can easily be compared. An improved background model is also introduced, which combines empirical scanning against real background DNA at the single motif level with model based computations at the composite level. Compo can return either an ordered list of motifs, ranked according to p-values, or a Pareto front (solution set containing solutions not dominated in at least one dimension of objectives) corresponding to a multi-objective evaluation with sensitivity, specificity and spatial clustering as independent objectives. The multi-objective approach gives a collection of resulting motifs displaying more varying characteristics, and it allows necessary trade-offs between objectives to be made while analyzing the results, rather than prior to running the method.

Results

Here we present the Compo algorithm for motif discovery by first introducing the necessary definitions and specifying the relevant problems. We then give the practical implementation of the algorithm. Finally we present the experimental evaluation of the implementation.

Definitions

Let S = {S1, ..., S i , ..., S n } be a set of n symbol sequences each of which is defined over the alphabet Σ; for DNA sequences Σ = {A, C, G, T}. Let M = {M1, ..., M j , ..., M m } be a set of m motifs of interest. We assume that for each sequence – motif combination there exists a specific function which gives start positions for all instances of the motif on the sequence; i.e., a function Φ : Σ* × M → 2{1,2,...,|Σ*|}.

Definition 1 (Motif Support) Given the function Φ, sequence S i is said to support motif M j , denoted S S S i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaem4uam1aaSbaaSqaaiabdofatnaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3121@ (M j ), if Φ(S i , M j ) ≠ . Moreover, the support set of M j is all the sequences in S that support M j ; i.e. SS S (M j ) = {S i |S i S Φ(S i , M j ) ≠ }. The absolute support is then the size of SS S (M j ), i.e. |SS S (M j )|.

Definition 2 (Module Support) Given the function Φ, sequence S i S is said to support module Ms M, denoted S S S i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaem4uam1aaSbaaSqaaiabdofatnaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3121@ (Ms), iff M j MsΦ(S i , M j ) ≠ . Moreover, the support set of Msis all the sequences in S that support Ms; i.e. SS S (Ms) = {S i |S i S M j MsΦ(S i , M j ) ≠ }. The absolute support is then the size of SS S (Ms), i.e. |SS S (Ms)|. Note that S S S i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaem4uam1aaSbaaSqaaiabdofatnaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3121@ (Ms) is an indicator variable but SS S (Ms) is a set of sequences.

Proposition 1 (Monotonicity of module support) Given any S, and any Mt Ms M, then SS S (Ms) SS S (Mt).

Interesting modules are modules supported by many of the sequences in S. This notion is formally defined as follows.

Definition 3 (Frequent Module) For a given support threshold σ {1,2...,|S|}, module Msis said to be frequent in S iff |SS S (Ms)| ≥ σ.

The useful metric of support is a set metric, i.e. defined over the sequence set S. On the other hand, given a single sequence seq Σ*, it is also relevant to ask how likely it is that a given module has a hit in the sequence. We call the relevant metric module hit-probability which is formally defined next.

Definition 4 (Module Hit-probability) Given a sequence seq Σ*, hit-probability of module Ms M is probability of seq supports Ms. Formally, the module hit-probability of Ms M is Prob(SS seq (Ms)).

Proposition 2 (Monotonicity of hit-probability) Given any S, and any Mt Ms M, then Prob(SS seq (Ms)) ≤ Prob(SS seq (Mt)).

The hit-probability can be virtually defined over arbitrary sequences. In this work, we are particularly interested in representative background sequences or sequences generated from the background model BM. For this reason, lower hit-probabilities correspond to higher specificity (divergence from the background).

Definition 5 (Specific Module) Given a representative background sequence bgseq ~ BM, and specificity threshold ψ, module Ms M is called specific module iff Prob(SS bgseq (Ms)) ≤ ψ.

In addition to support and hit-probability, an important metric is the statistical significance (significance for short) defined below. Significance is interpreted as how improbable the observed support is in a corresponding set of background sequences.

Definition 6 (Module Significance) Given S and BM, significance of module Ms M is probability of having support of at least |SS S (Ms)| in a background sequence set BS which is generated from BM and structurally equivalent to S, i.e. |S| = |BS| and i {1,2, ..., |S|} (S i S) (BS i BS) (|S i | = |BS i |). Formally, the module significance of Ms M is Prob(|SS BS (Ms)| ≥ |SS S (Ms)|).

Definition 7 (Significant Module) For a given significance threshold θ [0..1], module Ms M is significant if Prob(|SS BS (Ms)| ≥ |SS S (Ms)|) ≤ θ.

Problem specification

We consider three basic problem specifications (Problems 1, 2 and 3) within the setting presented above.

Problem 1 (Frequent and Specific Modules) For fixed S, BM, M, and given support threshold σ and specificity threshold ψ, find all modules Ms M which are frequent and specific.

Problem 1 is very similar to well established itemset and sequential itemset mining problems [19]. In these problems, the solution space typically grows very large and many solutions are usually not interesting. Therefore users are allowed to define their interest by specifying constraints. The user defined constraints are enforced by the mining system in order to focus the search on the interesting solutions only [20]. Moreover, certain classes of constraints (e.g. monotonicity) make the search efficient: this is done by pushing the constraints inside the mining process.

What is common to itemset and sequential itemset mining approaches is the generation of complete solutions; i.e. every solution (frequent and specific modules) satisfies the user specified threshold parameters and constraints. On the other hand, in motif discovery problems, incomplete solutions employing heuristic searches are usually preferred. These solutions are supposed to optimize some well-defined optimality criterion (e.g. support or hit-probability). However, there is usually more than one optimality criterion, thus making the problem a multi-objective optimization problem [21]. There are basically two different ways to approach this. One possibility is to define a scheme for combining the different optimality criterions into a single criterion, score every motif according to this combined criterion, and return a list of motifs ranked according to score. The scheme for combining criterions may be ad hoc, or it may for instance be based on an unexpectedness scheme with ranking of p-values as described in Motif scoring. Ranking according to a single criterion is easy to relate to for a user. It is thus advantageous for novice users, when several data sets are analyzed rapidly, or when an objective criteria for selection is needed, such as with automatic benchmarks. We define the combined-objective approach to solution space as follows:

Problem 2 (Top-ranking Modules) Given the motif set M, module size c, a desired number n of composite motifs to be returned, and a score function f mapping composite motifs to scalar score values, find the n top-ranking modules according to the score function, i.e. Ms M s.t. |Ms| = c and f(Ms) >= f(Mt) for any non-returned motif Mt.

The other possibility is to fully treat motif discovery as a multi-objective optimization problem with each objective representing a separate dimension of optimality. One can then return the Pareto front of composite motifs. The Pareto front contains all non-dominated motifs, where dominated means that there exists another motif with equal or better score values for all objectives. As this selects motifs that score high in different dimensions of optimality, it may give a more varied collection of output motifs. For in-depth analysis of a data set this may give a richer picture of potential regulators. We define the multi-objective approach to solution space as follows:

Problem 3 (Pareto-optimal Modules) Given the motif set M, module size c, find Pareto front of M, i.e. Ms M s.t. |Ms| = c and Msis non-dominated in specified dimensions.

The definition given in Problem 3 is very general in the sense that any number of dimensions can be incorporated. For instance, support and hit-probability can be selected as dimensions.

Given the dimensions of interest, the input sequence set and the background model, a straightforward complete solution to Problem 2 or Problem 3 can be obtained as follows.

Generate every Ms M s.t. |Ms| = c and output any motif satisfying the criteria in Problem 2 or 3.

The number of subsets of M can grow exponentially, for instance when c ≈ |M|/2, thus making the straightforward approach infeasible when |M| is large. Fortunately, though the motif set M can be large (i.e. hundreds of motifs, e.g. the full TRANSFAC database), most biological modules comprise at most several individual motifs. So, by bounding c with a relatively small constant (e.g. 4 or 5), the straightforward approach becomes feasible, as the number of such subsets grows polynomially. This observation allows us to exhaustively consider only modules with up to several constituent motifs. The straightforward approach may become unpractical when |M| is large even though c is fixed to at most several. As a realistic approach for solving Problem 2 or 3 efficiently we propose the Compo algorithm as described in the next section. The main advances are in exploiting monotonicities and using heuristics and approximations for efficient module discovery. This enables Compo to cope with large |M| (order of hundreds).

The Compo algorithm

This section gives a general overview of the Compo algorithm. Details on each step of the algorithm are given under relevant subsections of Implementation, as indicated below.

The general workflow of Compo is shown schematically in Figure 1. A set S of regulatory regions is retrieved from a sequence database, and a set M of regulatory motifs is retrieved from a motif database or discovered de novo by any external method. The hit positions of all motifs M j M in every sequence S i S are then found (Pre-processing of input). Composite motifs are enumerated in an implicit search tree. For each enumerated composite motif node, the support and hit-probability are calculated. Support is the number of sequences with module hit; hit-probability is the (approximated) probability of having at least one module hit in a background sequence. For each node in the tree, these values are calculated from the values at the parent node and the values of the added single motif (Enumeration of composite motifs). Compo supports two alternative forms of output – a list of motifs ranked according to a combined significance measure (Motif scoring), or a Pareto front of optimal motifs according to a multi-objective optimization (Pareto front). Compo can optionally allow non-perfect matches (Allowing non-perfect matches) and enforce distance constraints (Incorporating distance constraints). Finally techniques used to make Compo as efficient as possible are briefly discussed (Computational efficiency).

Figure 1
figure 1

Compo workflow. The general workflow of Compo, from a list of genes defining regulatory regions of interest, to a Pareto front or ranked list of composite motifs as potential regulators of the genes.

Implementation

Pre-processing of input

The first step of the analysis is determination of motif hits, i.e. the function Φ(S i , M j ). As Compo operates on discretized motif hits, and thus works independently of the internal representation of single motifs, any external de novo motif discovery method or motif library can be used for this first step. If probabilistic motifs (e.g. PWMs) are to be used with Compo, the continuous match values at each position have to be discretized into hits and no-hits. This can be done by setting a hit threshold for each motif. Hit thresholds can be calculated algebraically or determined based on the resulting distribution of hits in input sequences and background sequences.

When the match values of motifs have a clear probabilistic interpretation, such as log-likelihoods or log-odds, it can be meaningful to simply set all hit thresholds to a universal, analytically reasoned value. Similarly, as p-values can be computed from match scores, hit-thresholds may be found that correspond to a specific p-value of motif match according to a stochastic sequence model. Alternatively, hit-thresholds may be set to control some property of the resulting hit-distribution, for example to achieve a specific frequency of hits in the input or background sequences. In general, any function can be defined on the number of hits in input and background sequences respectively, and the hit-threshold set to the value that optimizes this function.

In the current implementation we calculate a desired number of hits as the number of input sequences multiplied by a hit density factor, and then set the hit-threshold of each motif to the value that achieves this desired number of hits across the input sequences. For the initial step of obtaining continuous match values of motifs against sequences, we make use of the TAMO motif tools [22]. In the default setting, several values are tried for the hit density factor and the most significant motifs across density factor values are returned.

Enumeration of composite motifs

Combinations of single motifs are conceptually explored exhaustively in a search tree as shown in Figure 2. Each node (except the root) is associated with a single motif and each path from the root to a leaf node corresponds to a unique combination of single motifs (i.e. a composite motif). The number of levels in the search tree is constrained by a maximum number |Ms| = c of motif components, given as a parameter to the algorithm. A search tree of c levels encompasses all combinations of up to c motifs. Each leaf or non-leaf node, z, with the respective single motif M z M, has two basic variables associated with it: the support set H z = SS S (M z ) and the hit-probability P z = Prob(SS bgseq (M z )). The values of H z and P z are pre-computed and used whenever needed. Additionally, each node has two other variables HX.zand PX.zfor incrementally updating the partial module support and hit-probability, respectively. These values represent support and hit-probability of composite motifs represented by the path from root to node z. The HX.zand PX.zvalues for node z are calculated based on the accumulated values for parent node and H z and P z values, respectively.

Figure 2
figure 2

Search tree. Implicit search tree, where numbers inside nodes correspond to single motifs (z), and paths from the root to a node correspond to composite motifs. The values H{1,3} and P{1,3} corresponds to the path in bold. The X symbol indicates that some composite motifs will be pruned during search.

The support set HX.zof module X.z, where X is the set of single motifs down to node z, can be computed from module support H X and single motif support H z by just intersecting the sets H X and H z . Formally, HX.z= H X H z . The hit-probability PX.zcan similarly be computed as PX.z= P X ·P z . The root node of the search tree is an empty module, and as there are no single motifs that require match, P root is trivially 1, and H root is the set of all input sequences. Values for the nodes in the tree are then calculated incrementally down the tree in a depth-first order. This model was also considered in a previous paper [21].

Motif scoring

Compo can assign a score to each candidate composite motif and return a ranked list of composite motifs as output. This requires that several desirable characteristics, such as high support and low probability of hit in background, are combined into a single score value. We use an approximated p-value of observed composite motif support as our score measure. The generality of the p-value as a measure allows composite motifs with differing characteristics to be directly compared.

The significance of a composite motif, i.e. the approximated p-value of observed support, is computed by the following four steps:

  1. 1.

    Position-level probability: The probability that a single motif occurs at a specific location in a background sequence. This is estimated as the frequency of motif hits in real DNA sequences serving as background.

  2. 2.

    Sequence-level probability: The probability that a single motif occurs at least once in a sequence of given length. This is computed as the union of probabilities of occurring at any location. As an approximation, the match probabilities are assumed to be equal at all locations, ignoring auto-correlation. This gives the formula: 1 - (1 - p pos )l, where l is average length of sequences and p pos is the probability of motif hit at a single position from the background model BM.

  3. 3.

    Hit-probability (Composite motif-level probability): The probability that a composite motif is occuring in a sequence of given length. This is computed as the product of sequence probabilities of each motif component.

  4. 4.

    Significance p-value (Dataset-level probability): The probability of seeing at least the observed support in a corresponding set of background sequences. This is computed as the right tail of a binomial distribution, i.e. as the probability of obtaining at least k out of n successes with Bernoulli trial probability p. Here, p is the composite motif-level probability, n is the number of input sequences, and k is the support of the composite motif.

This scoring procedure is a mix of model-based (algebraic) and empirical evaluation of significance. A purely empirical evaluation would compute a p-value directly in point 4 by comparing observed support with support in several different background sets of sequences. Conversely, a purely algebraic evaluation would compute match probabilities in step 1 algebraically from probabilities at each motif position according to a simplified DNA model.

The empirical and algebraic approaches each have their strengths and weaknesses. As the motif score is used to contrast potential binding sites against surrounding DNA sequence, having a background that is as realistic as possible is desirable. DNA sequences have several properties that depart from random sequence models, and using the frequency of hits in real genomic sequence may thus capture the background more accurately. On the other hand, estimates based on empirical frequencies become inaccurate when the frequency is low, and are limited by the minimum frequency. As the p-values in step 4 of the scoring procedure are often extremely low, the observed support would have to be compared against a huge collection of background sequence sets. Also at the third step the probability values are often very low when there are many motif components.

The mixed solution we have chosen combines advantages of both approaches. As the position-level probability is estimated from hits in real DNA, we have avoided assuming a simplified model of genomic background sequences at the local level. Using algebraic computations at step 2 and 3 instead assumes a random (simplified) model of the spatial distribution of motif occurrences. It is of course possible that motifs are unevenly distributed even in the background model of non-modules, but we consider this assumption less problematic. As the values at step 3 and 4 are computed algebraically, they are not limited by the lowest possible empirical frequency. Also, efficient algebraic formulas are used for computing values at step 2, 3 and 4 in the large search space of composite motifs, while the computationally demanding process of scanning against real negative data in step 1 is only performed once for each single motif, in the initial phase of the analysis.

Our calculations in step 2 and 3 are based on simple and approximate formulas which ignores correlations. The main motivation for this approach is the efficiency of the simple and incremental calculation in the search tree, as described in Enumeration of composite motifs. Actually, similar tradeoffs for increased efficiency are inherent in most motif discovery methods [23] due to the difficulty of the problem.

Pareto front

As an alternative to motif ranking based on a combined score, Compo also supports motif discovery as a multi-objective optimization problem. The composite motif-level probability described in Motif scoring then constitutes an independent final objective. Composite motif support and enforced distance restriction form additional separate objectives, and a Pareto front of motifs is returned as described in Problem specification.

Intuitively, the Pareto front contains motifs that have at least one or a few very good characteristics. This may make the motif discovery process more informative, as the returned motifs typically represent a broader view of the composite motif space (i.e. motifs with more varied characteristics) compared to the same number of motifs from a list ranked according to a single combined objective. The Pareto front can be visualized as an n-dimensional heat map, allowing the user to get an overview of trends in the results. After the search is finished, the user may decide on how to balance different criteria against each other and inspect motifs with desired combination of properties.

Allowing non-perfect matches

It is in some cases biologically relevant to allow for occasional absence of individual TF binding sites in module instances. This may be a desirable feature even if we assume that the module always contains the full set of binding sites, as it makes the approach more robust against inaccuracies in the single motif scanning step.

In this case we need to know the number of allowed motif mismatches in order to determine hit-probability and the support set. We say that a composite motif is defined by its component motifs, and refer to the different possibilities of allowed number of lacking motif matches as variants of the composite motif. A variant allowing q mismatches for a composite module consisting of a set X of single motifs is denoted as V X q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aa0baaSqaaiabdIfaybqaaiabdghaXbaaaaa@2FDB@ . When we want to isolate a particular component motif we write V X . y q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aa0baaSqaaiabdIfayjabc6caUiabdMha5bqaaiabdghaXbaaaaa@323A@ , where y is the single motif of particular interest and X is the set of remaining single motifs. In order to compute values incrementally from the already computed values of parent and newly added single motif, we need to keep values for different numbers of allowed motif misses. More specifically, a variant V X . y q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aa0baaSqaaiabdIfayjabc6caUiabdMha5bqaaiabdghaXbaaaaa@323A@ generally uses the pre-computed values of two variants of the parent single motif, V X q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aa0baaSqaaiabdIfaybqaaiabdghaXbaaaaa@2FDB@ and V X q 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aa0baaSqaaiabdIfaybqaaiabdghaXjabgkHiTiabigdaXaaaaaa@31B8@ , as well as values of the additional motif y. In this case the support set and hit-probability is computed incrementally as follows, where q refers to the number of allowed mismatches.

H X . y q = H x q 1 ( H X q H y 0 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemisaG0aa0baaSqaaiabdIfayjabc6caUiabdMha5bqaaiabdghaXbaakiabg2da9iabdIeainaaDaaaleaacqWG4baEaeaacqWGXbqCcqGHsislcqaIXaqmaaGccqWIQisvcqGGOaakcqWGibasdaqhaaWcbaGaemiwaGfabaGaemyCaehaaOGaeSykIKKaemisaG0aa0baaSqaaiabdMha5bqaaiabicdaWaaakiabcMcaPaaa@4543@

As hits for the new motif y are assumed independent from hits for the motifs X in background, and since H X q 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemisaG0aa0baaSqaaiabdIfaybqaaiabdghaXjabgkHiTiabigdaXaaaaaa@319C@ is a strict subset of H X q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemisaG0aa0baaSqaaiabdIfaybqaaiabdghaXbaaaaa@2FBF@ , the formula for hit-probability becomes (detailed derivation given in Additional file 1):

P X . y q = P X q 1 + ( P X q P X q 1 ) P y 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aa0baaSqaaiabdIfayjabc6caUiabdMha5bqaaiabdghaXbaakiabg2da9iabdcfaqnaaDaaaleaacqWGybawaeaacqWGXbqCcqGHsislcqaIXaqmaaGccqGHRaWkcqGGOaakcqWGqbaudaqhaaWcbaGaemiwaGfabaGaemyCaehaaOGaeyOeI0Iaemiuaa1aa0baaSqaaiabdIfaybqaaiabdghaXjabgkHiTiabigdaXaaakiabcMcaPiabgwSixlabdcfaqnaaDaaaleaacqWG5bqEaeaacqaIWaamaaaaaa@4CDC@

In the case when the number of allowed mismatches is equal to or greater than the number of components, H is trivially the set of all sequences and S is 1. For each composite motif, values are computed for variants with 0...q mismatches (as long as q is not greater than the number of components).

Incorporating distance constraints

As co-acting TFs may be expected to be in physical proximity, their binding sites are often clustered in sequence space. This is not a strict requirement as the DNA strand may form loops between distant sites. However, in particular in combination with flexibility regarding single motif mismatches, which increases the general robustness of motif discovery, limits on motif distances can still be a reasonable assumption. Compo supports constraints on the distance between motifs by requiring component motifs to have hits within a sequence window of a specific length.

As each component motif may have several instances in a given sequence, it is not entirely trivial to check distance constraints. One possibility is to slide a window through a sorted list of all motif occurrences and check whether the sliding window at any point contains occurrences of all components. A second possibility is to enumerate all combinations of occurrences from each motif component and check whether all occurrences of any combination are within the window. We have as default chosen the second method, as it allows an intuitive recursive implementation and can easily be combined with options such as non-overlapping motifs and single motif mismatches.

Distance restrictions are also taken into consideration in motif scoring. Hit-probability is then the probability that the composite motif occurs within a distance window in a background sequence. As the composite motif may occur in any of the (overlapping) windows of the sequence, hit-probability is computed by combining the probability of occurring in the first window of the sequence, and the probabilities of occurring in any of the remaining windows given that it did not occur in the preceding window. The details on how distance constraints are enforced when computing the support set of a composite motif, and how hit-probability is computed, are given in Additional file 1.

Computational efficiency

The running time of Compo is mainly determined by the number of input sequences |S|, the number of input single motifs |M|, and the maximum number of motif components c considered. Several techniques are employed to increase the computational efficiency. While exploring the search space, motif values are computed incrementally down the tree from parent values and pre-computed active node values, instead of being computed from ground up each time. As the support sequences H X are computed as a set intersection, and the incremental computation of hit-probability is done algebraically, values at each node are computed with small computational effort. Furthermore, if there are many input sequences the computation of support set can be done very efficiently using bit strings. A branch-and-bound approach is used to prune the search tree. For each node visited in the tree, a bound on the highest achievable score for any node in the subtree is computed and compared against the Pareto front or ranked motif list discovered so far. If the bound is dominated by the current Pareto front or ranked list, the whole subtree is discarded from search space. For large runs more than 99.9% of the search tree is typically discarded this way. Details of the branch-and-bound approach are given in previous publication [21] and in Additional file 1.

Testing

Compo was tested on a large benchmark suite [24] compiled from the TransCompel data base (v9.4) [25], in addition to two smaller suites compiled from muscle- [26] and liver-specific [27] genes, and a recent suite compiled from the REDfly database [28]. It was run with automatic parameter selection, meaning that for each data set Compo automatically selected parameter values from a list of discrete possibilities. Although the performance of Compo could have been further improved by manually specifying optimal parameter values, this could easily have caused overtuning and was therefore avoided.

In the main benchmark suite (compiled from the TransCompel database), target PWMs are mixed with randomly selected TRANSFAC PWMs that have no annotated binding sites in a given data set. These PWMs without annotated binding are referred to as noise PWMs, and are introduced to simulate a situation without accurate knowledge of the true regulators. The benchmark suite defines 6 different noise levels, where the percentage of noise PWMs varies between 0% and 99%. The highest noise level, denoted as 99%, uses the whole TRANSFAC as input and has thus really around 99.7% noise PWMs.

At each noise level, ten different data set versions are defined, corresponding to different random selections of noise PWMs. This benchmark thus defines a total of 600 runs on individual data sets, with each data set consisting of between 5 and 16 input sequences. The results on this benchmark are shown in Table 1. Compo outperforms all other methods on all noise levels of the benchmark.

Table 1 Prediction performance

Given the good performance of Compo we further investigated how the performance was influenced by relevant unique features of Compo, in particular the background based on real DNA sequence and the possibility of inferring motif properties across co-regulated sequences. The partly empirical background computations are unique to Compo, while the possibility of inferring motif properties is shared with CMA and ModuleSearcher. Table 2 compares the default score of Compo with scores achieved when using only a random model of DNA (computations according to a multinomial sequence model instead of real background DNA) and when considering each sequence in isolation (and not support across several sequences). It seems that both the empirical background and the inference of composite motif properties across co-regulatory regions contribute strongly to the high performance of Compo. When either of these elements is removed, the performance of Compo drops to a level comparable to other methods on the TransCompel suite.

Table 2 Influence of background models and support

On muscle and liver benchmarks the performance of Compo is equal to or better than most other methods, except for MSCAN on muscle data and Cluster-Buster on liver data (see Table 3). The benefit from support is less obvious here when judged by the nCC score. However, using support tends to give more conservative solutions with less false positives compared to independent sequence runs (data not shown). This benchmark also shows the effect of allowing non-perfect matches. The effect is most pronounced in the muscle data set where the relevant binding site motifs (Mef2, Myf, Sp1, SRF and TEF) on average are found in only 42% of the modules, compared to 57% for the liver data set and motifs (HNF-1, HNF-3, HNF-4 and CEBP).

Table 3 Results on muscle and liver data sets

The benchmarks discussed above each have their strengths and limitations. The TransCompel benchmark is broad and robust, with 10 data sets, different levels of noise, and a total of 600 runs. However as TransCompel currently contains almost exclusively TFBS pairs, methods are only tested on the discovery of small composite motifs. The muscle and liver benchmark data sets have larger composite motifs, but with only 2 data sets and a total of 2 runs, the results are less robust. An interesting addition to these two benchmarks are presented in a recent article by Ivan et al. [28]. A total of 33 data sets were compiled based on data from the REDfly database [29]. The data sets from this benchmark have been made available, together with a relatively simple evaluation procedure. Performance data according to this evaluation procedure has also been made available for a few selected methods. The accompanying evaluation procedure requires exactly one composite motif instance to be predicted for each sequence, requires all predicted instances for a given data set to have equal length, and only evaluates the predictions of start locations, not length predictions of composite motif instances. Based on this, the sensitivity of predictions are calculated for each data set, along with a p-value of whether predictions are significantly better than random. The main performance measure is the number of data sets with significant prediction (at the 0.05 level).

We evaluated Compo on this benchmark according to the accompanying evaluation procedure that assumes CRM length of 750 bp for all data sets. Results are given in Table 4. Compo made significantly good predictions (at the 0.05 level) on 9 out of the 33 data sets. This is better than random and better than the methods CisModule (4) and MCD (4), similar to D2z (9) and Stubb (10), and lower than CSAM (14), the best performing method which was accompanying the benchmark.

Table 4 Results on Drosophila data sets

Further details on the experimental setup are given in Additional file 1.

Pareto front

Compo may optionally return a Pareto front corresponding to a multi-objective evaluation on sensitivity, specificity and spatial clustering. Intuitively, the Pareto front contains motifs that have good values for at least one or a few of these characteristics. This gives a broader view of possibly interesting motifs, and leaves the final selection of output motifs to a subjective evaluation by the user. In addition to giving a broad view, this also avoids combining different objectives by general formulas that are typically inferior to expert judgment. This defining property of multi-objective optimization, however, also means it is not suited for use in automatic benchmarking procedures. For this reason, we used a standard ranking of motifs according to a combined score in the benchmarks.

To give an example of properties of Pareto fronts for composite motifs, we show the Pareto front for one of the data sets of the TransCompel benchmark presented above. On this data set, the highest-ranked motif predicted by Compo was not accurate. The top-ranking composite motif was composed of PWMs related to the Ets and GATA TFs, while the annotations for the data set specified a composite element composed of an AP1-related and a NFAT-related PWM. Figure 3a shows a heat map of the Pareto front for this data set, with support as first dimension, distance restriction as second dimension and specificity as third dimension (color). An interesting composite motif should typically have high support, be closely spaced, and be specific with respect to background. To the upper left are very specific (red) composite motifs with low support and low spatial clustering, while at the lower right are less specific (blue) with high support and high spatial clustering. Expert users may then make subjective judgements regarding trade-offs between these characteristics and further inspect composite motifs of interest.

Figure 3
figure 3

Pareto front. a) Pareto front of optimal composite motifs corresponding to a multi-objective optimization with respect to support, distance restriction and specificity (hit-probability). Red colors show high specificity and blue colors show low specificity. b) Corresponding layout of motifs where colors instead denote combined motif score (significance). Red colors correspond to highest-ranked motifs according to combined score. The top-ranked composite motif is located at support 5 and distance window 200.

Figure 3b shows a corresponding layout of the composite motifs on the x- and y-axes, but with the z-axis (color) representing the score of the composite motifs according to our combined score measure (p-values). This heat map show that a composite motif with support 5 and distance window 200 has the highest combined motif score (the composite motif composed of an Ets- and a GATA-PWM mentioned above). Some of the alternative composite motifs are composed of the true annotated TFs for this data set. Composite motifs with NFAT as component are marked with an X in the figure, while O denotes AP1. A composite motif with support 4 and a distance window of 200 is composed of both AP1 and NFAT. Although the highest-ranked composite motif was not related to any annotated TF, there are in the Pareto front other composite motifs with better spatial clustering (support 3, distance window 50) or better specificity denoted by orange color (at support 4, distance window 200) that contain one or both of the annotated TFs.

Discussion

Given a set of genes (believed to be co-regulated), the objective with composite motif discovery methods is to predict transcription factors that are underlying regulators of the gene set. The starting point would be the gene list with known motifs for individual factors available from databases such as TRANSFAC [25] and Jaspar [30]. Alternatively, de novo single motif discovery may be performed to discover overrepresented short contiguous motifs in the sequences.

Given upstream gene regions and either known or de novo motifs, composite motif discovery methods such as Compo may be used to discover enriched combinations of motifs, which may correspond to cis-regulatory modules. With CC scores ranging from 0.35 to 0.52 on the TransCompel benchmark, Compo is consistently able to give useful computational binding site predictions for sets of co-regulated genes even when the true regulators are not known.

Users may have different levels of prior knowledge about the composite motifs they are seeking when they resort to a computational method. Some users may e.g. know the exact composition of the relevant module, whether all TFs are obligatory for the function of the composite motifs, and what the typical distances between binding sites in a module are. Other users may know nothing more than a list of TFs potentially regulating a list of genes. A computational method should therefore allow such intuitive parameters to be set if known, but it should not be necessary to set arbitrary values when no prior information is available. Compo allows many intuitive parameters to be set, but all of these parameters may also be estimated automatically. As Compo uses p-values as a universal significance measure, motifs discovered using different parameter settings can be directly compared. This allows Compo to be run with multiple settings and then automatically selecting the most significant motifs across these settings. By default Compo tries a large range of values for the number of components in modules; the size of the distance window, the allowed number of component motif misses, and the hit density factor used to determine hit-thresholds in the initial discretization. Furthermore, Compo has no so called nuisance parameters – parameters that reflect properties of the algorithm rather than properties of the module to be discovered.

The composite motif discovery method that is most similar to Compo is probably ModuleSearcher [7]. However, although Compo and ModuleSearcher are similar in search algorithm, there are also important differences. Compo uses real background DNA in its score computations, and may instead of a ranked list also return a multi-objective solution as output. If a standard ranked list is chosen as output in Compo, the composite motifs are ranked by p-values, which also allows composite motifs to be compared across parameter settings. Instead of relying on a fixed or specified value for each parameter, Compo can thus take a list of candidate parameter values as input and select the highest scoring motifs across parameter settings automatically. Furthermore, Compo explicitly models fault-tolerant absence of motif instances in composite motifs. Finally, Compo is able to use several different approaches to pre-processing in the search procedure.

Conclusion

The results on the benchmark suite show a very competitive quantitative performance for Compo using default parameters, in particular in cases where support across sequences may be utilized. In addition to this, Compo has some qualitatively advantageous properties. The intuitive parameters and discovery algorithm make the method relatively transparent, and the results are more easily interpretable compared to many other methods. The option of considering composite motif discovery as a multi-objective optimization problem allows users to spot higher-order trends in results and to postpone making trade-offs between objectives until after the search. Finally, with a general discovery algorithm and a relatively accessible Python source code, Compo lends itself to experimentation and further development.

Methods

The main benchmark data set consists of all composite modules in the TransCompel database [25] that have at least five annotated instances. Details are given in Klepper et al. [24]. The prediction performance of Compo was compared against the methods Cister [12], Cluster-Buster [18], Stubb [31], ModuleSearcher [7], MScan [13], CMA [32], CisModule [9] and MCast [17]. The performance of each method was tested using PWMs compiled based on these binding sites (custom matrices version of benchmark). The robustness of predictions was tested by adding non-relevant (noise) motif matrices to the input data, as described in the original benchmark study [24]. Compo was also tested on data sets of liver-specific [27] and muscle-specific [26] gene sets taken from the literature, as well as a recent benchmark based on the REDfly database [28]. Visualizations of annotated binding sites in the muscle and liver data sets are given as Additional file 2 and 3, respectively.

Data on the other methods are taken from the original benchmark study [24], and are in general generated with default parameter settings. Since choosing the proper parameter values can sometimes prove crucial for performance, it was decided to provide the programs with a few general clues where applicable. The size of modules was specified as not exceeding 200 bp (300 bp in the muscle dataset). The modules were defined as consisting of exactly two single binding sites for different TFs in the TransCompel dataset, and possibly up to ten binding sites for four and five different TFs on the liver and muscle sets respectively. Furthermore, binding sites could potentially overlap, and the composition of the modules in liver and muscle sets was allowed to vary between sequences. As ModuleSearcher does not match the PWMs against the sequences itself, a program called MotifScanner was used as pre-processor for ModuleSearcher. Both of these programs were developed by the same group and are part of the Toucan suite of tools for regulatory sequence analysis [33].

The performance on benchmark data is given as the nucleotide-level correlation coefficient (nCC) from the comparison between predicted and known modules, as previously described e.g. in the benchmark studies of Tompa et al. [3] and Klepper et al. [24]. Here nTP, nFP, nTN and nFN represent true positive, false positive, true negative and false negative predictions at the nucleotide level.

n C C = n T P n T N n F N n F P ( n T P + n F N ) ( n T N + n F P ) ( n T P + n F P ) ( n T N + n F N ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOBa4Maem4qamKaem4qamKaeyypa0tcfa4aaSaaaeaacqWGUbGBcqWGubavcqWGqbaucqGHflY1cqWGUbGBcqWGubavcqWGobGtcqGHsislcqWGUbGBcqWGgbGrcqWGobGtcqGHflY1cqWGUbGBcqWGgbGrcqWGqbauaeaadaGcaaqaaiabcIcaOiabd6gaUjabdsfaujabdcfaqjabgUcaRiabd6gaUjabdAeagjabd6eaojabcMcaPiabcIcaOiabd6gaUjabdsfaujabd6eaojabgUcaRiabd6gaUjabdAeagjabdcfaqjabcMcaPiabcIcaOiabd6gaUjabdsfaujabdcfaqjabgUcaRiabd6gaUjabdAeagjabdcfaqjabcMcaPiabcIcaOiabd6gaUjabdsfaujabd6eaojabgUcaRiabd6gaUjabdAeagjabd6eaojabcMcaPaqabaaaaaaa@6D5F@

In the benchmark suite compiled from the REDfly database, we followed the evaluation procedure defined in the article proposing the benchmark [28]. A collection of 53 PWMs accompanying the benchmark was used as single motif input for each data set. Here, Compo was compared against Stubb [31], MCD [28, 34], D2z and CSAM [28].

Availability and requirements

Compo is written in Python, and is freely available as source code under the GPL license at http://tare.medisin.ntnu.no/compo/index.php.

References

  1. Bailey TL, Elkan CE: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 1995, 21: 51–80.

    Google Scholar 

  2. Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 2000, 296(5):1205–14.

    Article  CAS  PubMed  Google Scholar 

  3. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23: 137–44.

    Article  CAS  PubMed  Google Scholar 

  4. Kato M, Hata N, Banerjee N, Futcher B, Zhang MQ: Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol 2004, 5(8):R56.

    Article  PubMed Central  PubMed  Google Scholar 

  5. Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA: The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol 2003, 20(9):1377–419.

    Article  CAS  PubMed  Google Scholar 

  6. Marsan L, Sagot MF: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J Comput Biol 2000, 7(3–4):345–62.

    Article  CAS  PubMed  Google Scholar 

  7. Aerts S, Van Loo P, Thijs G, Moreau Y, De Moor B: Computational detection of cis-regulatory modules. Bioinformatics 2003, 19(Suppl 2):II5-II14.

    Article  PubMed  Google Scholar 

  8. Sandve GK, Drabløs F: A survey of motif discovery methods in an integrated framework. Biol Direct 2006., 1(11):

  9. Zhou Q, Wong WH: CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc Natl Acad Sci USA 2004, 101(33):12114–9.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Xing EP, Wu W, Jordan MI, Karp RM: Logos: a modular bayesian model for de novo motif detection. J Bioinform Comput Biol 2004, 2: 127–54.

    Article  CAS  PubMed  Google Scholar 

  11. Gupta M, Liu JS: De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci USA 2005, 102(20):7079–84.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Frith MC, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 2001, 17(10):878–89.

    Article  CAS  PubMed  Google Scholar 

  13. Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics 2003, 19(Suppl 1):i169–76.

    Article  PubMed  Google Scholar 

  14. Wagner A: Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics 1999, 15(10):776–84.

    Article  CAS  PubMed  Google Scholar 

  15. Sharan R, Ovcharenko I, Ben-Hur A, Karp RM: CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics 2003, 19(Suppl 1):i283–91.

    Article  PubMed  Google Scholar 

  16. GuhaThakurta D, Stormo GD: Identifying target sites for cooperatively binding factors. Bioinformatics 2001, 17(7):608–21.

    Article  CAS  PubMed  Google Scholar 

  17. Bailey TL, Noble WS: Searching for statistically significant regulatory modules. Bioinformatics 2003, 19(Suppl 2):II16-II25.

    Article  PubMed  Google Scholar 

  18. Frith MC, Li MC, Weng Z: Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Res 2003, 31(13):3666–8.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Agrawal R, Srikant R: Mining sequential patterns. Eleventh International Conference on Data Engineering (ICDE'95) 1995, 3–14.

    Chapter  Google Scholar 

  20. Boulicaut JF, Jeudy B: Constraint-Based Data Mining. In The Data Mining and Knowledge Discovery Handbook. Springer; 2005.

    Google Scholar 

  21. Sandve GK, Drabløs F: Generalized Composite Motif Discovery. In 7th Int Conf on Knowledge-Based Intelligent Information and Engineering Systems, KES. Volume 3683. LNCS/LNAI, Springer-Verlag; 2005:763–769.

    Chapter  Google Scholar 

  22. Gordon DB, Nekludova L, McCallum S, Fraenkel E: TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. Bioinformatics 2005, 21(14):3164–5.

    Article  CAS  PubMed  Google Scholar 

  23. Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 1995, 3: 21–9.

    CAS  PubMed  Google Scholar 

  24. Klepper K, Sandve GK, Abul O, Johansen J, Drablos F: Assessment of composite motif discovery methods. BMC Bioinformatics 2008, 9: 123.

    Article  PubMed Central  PubMed  Google Scholar 

  25. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, (34 Database):D108–10.

  26. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998, 278: 167–81.

    Article  CAS  PubMed  Google Scholar 

  27. Krivan W, Wasserman WW: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res 2001, 11(9):1559–66.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  28. Ivan A, Halfon M, Sinha S: Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biol 2008, 9: R22.

    Article  PubMed Central  PubMed  Google Scholar 

  29. Gallo SM, Li L, Hu Z, Halfon MS: REDfly: a Regulatory Element Database for Drosophila. Bioinformatics 2006, 22(3):381–383.

    Article  CAS  PubMed  Google Scholar 

  30. Bryne JC, Valen E, Tang MH, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A: JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res 2007, 36: D102–6.

    Article  PubMed Central  PubMed  Google Scholar 

  31. Sinha S, van Nimwegen E, Siggia ED: A probabilistic method to detect regulatory modules. Bioinformatics 2003, 19(Suppl 1):i292–301.

    Article  PubMed  Google Scholar 

  32. Kel A, Konovalova T, Waleev T, Cheremushkin E, Kel-Margoulis O, Wingender E: Composite Module Analyst: a fitness-based tool for identification of transcription factor binding site combinations. Bioinformatics 2006, 22(10):1190–7.

    Article  CAS  PubMed  Google Scholar 

  33. Aerts S, Van Loo P, Thijs G, Mayer H, de Martin R, Moreau Y, De Moor B: TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis. Nucleic Acids Res 2005, (33 Web Server):W393–6.

  34. Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarly acting cis-regulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D. pseudoobscura. Bioinformatics 2004, 20(16):2738–2750.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We want to thank Kai Trengereid, Tarjei Hveem, Vetle Valebjørg and Øystein Lekang for valuable contributions made as part of their Master's projects, and Jostein Johansen and Kjetil Klepper for help with benchmarking Compo. Also thanks to Kjetil Klepper for preparing visualizations of the muscle and liver data sets. FD has been supported by The National Programme for Research in Functional Genomics in Norway (FUGE) in The Research Council of Norway and by The Svanhild and Arne Must Fund for Medical Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geir Kjetil Sandve.

Additional information

Authors' contributions

GKS conceived the initial idea, devised the algorithms, implemented the method and drafted the main parts of the manuscript. OA contributed to the scientific content of the paper, formalized the machine learning perspective, drafted the section on problem definition and took part in writing on all parts of the manuscript. FD supervised and took part in all stages of the project.

Electronic supplementary material

12859_2008_2512_MOESM1_ESM.pdf

Additional file 1: Additional formulas and experimentation details. Further details on pruning of search space and computation of hit-probability, as well as details on the experimental setup. (PDF 81 KB)

12859_2008_2512_MOESM2_ESM.pdf

Additional file 2: Binding sites in muscle data set. A visualization of annotated binding sites in the muscle data set [26]. (PDF 35 KB)

12859_2008_2512_MOESM3_ESM.pdf

Additional file 3: Binding sites in liver data set. A visualization of annotated binding sites in the liver data set [27]. (PDF 22 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sandve, G.K., Abul, O. & Drabløs, F. Compo: composite motif discovery using discrete models. BMC Bioinformatics 9, 527 (2008). https://doi.org/10.1186/1471-2105-9-527

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-527

Keywords