Abstract
Background
It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cisregulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cisregulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning.
Results
We introduce methods designed for the detection and statistical evaluation of cisregulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate pvalues for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a maxgap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cisregulatory module sites.
Conclusion
The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cisregulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software.
Background
The identification of transcription factor binding sites and cisregulatory modules is a crucial step in the study of gene regulation. Transcription factor binding sites often occur together in clusters as cisregulatory modules and these modules can regulate the transcription of genes 100 kilobases (kb) or further from the module sites [1,2]. To study the regulation of transcription in silico, it has become necessary to scan large intergenic regions for occurrences of these modules. When scanning through large intergenic regions, it is important to be able to evaluate the statistical significance of predicted module sites in order to distinguish likely modules for experimental validation. The methods we introduce in this paper allow for the rapid detection of cisregulatory module sites as well as an estimation of their statistical significance. These tools will be of great use to any researcher wishing to scan large intergenic regions for potential cisregulatory modules.
The pvalue corresponding to a given match score for a transcription factor binding site is the probability of observing a match score at least as great for a randomly generated site. We start by describing a method for the accurate estimation of statistical significance of single binding sites. This method utilizes an extensive database of known human, mouse and rat transcription start sites, the CSHL mammalian promoter database (CSHLmpd) [3] to tabulate frequencies of occurrences for all kmers in known sequences. This information is then used to calculate pvalues that correspond to particular thresholds for matrices as well as to calculate the threshold of occurrence corresponding to a given pvalue. If a number of binding sites have been predicted and a position frequency matrix (PFM) that corresponds to these binding sites is known, this technique can be used to obtain a pvalue for the predicted sites. Conversely, if the desired pvalue is known for a given matrix, the threshold corresponding to this pvalue can be calculated and this can be used to find the sites in sequences. One major advantage of using an extensive database of known sequence regions to estimate pvalues is that the result will not depend on the sample data and therefore results obtained from different motiffinding or scanning algorithms can be compared using pvalues. Although previous work has addressed the problem of estimating the statistical significance of binding site matches, these methods were forced to either assume uniform nucleotide frequencies for all sequences tested or to look at distributions of scores across random sequences to be computationally feasible. The novel methods we present here utilize known DNA sequence in an efficient manner to avoid any prior assumptions on nucleotide distribution in regulatory DNA sequences. This allows us to make more accurate estimates on statistical significance of binding site predictions.
After describing the estimation of pvalues for single transcription factor binding sites, we show how the single site pvalues can be combined with a clustering probability to determine the overall module probability. We consider two separate models of cisregulatory modules, modules with constrained organization and modules without constrained organization. Modules without organizational constraints are simply a cluster of individual binding sites with the only parameters of the model being the pvalues of the individual sites and the clustering pvalue. The clustering probability in this model is calculated using a maxgap model. We also consider modules that have prescribed organizational constraints. The organizational parameters we consider are the order, spacing and orientation of motifs in the module. For modules with organizational constraints, the module probability is calculated using a maxgap model, and then adjusted for the organizational parameters.
Results
Statistical significance of single sites
Previous work on statistical significance of single sites
Many approaches to obtain pvalues for motif site matches using position weight matrices (PWMs) have been introduced [412]. We now briefly review the original works in this field, and touch upon some of the more recent work that has extended these original methods.
The work of Staden is the original work on calculating probabilities for finding motif matches in DNA sequence [4]. The method outlined by Staden uses probability generating functions to calculate the theoretical probability of finding motif matches. Much of the existing work on calculating probabilities of motif matches is built directly on Staden's work. Claverie showed that given the base composition of random sequence, one can compute the expected distribution of random scores for each individual position and, iteratively, the expected distribution of the total score [5]. This distribution of scores is close to a Gaussian, as in Staden's work, and the expected best matching score in each sequence follows an extreme value distribution. The software program NMksite allows for the calculation of statistical significance given a PFM and a threshold [6]. Beckstette et al. [11], introduced the ESAsearch algorithm which assesses the statistical significance of binding site matches using a method based on Staden's work. Baily and Gribskov introduced a method to estimate the statistical significance of motif matches based on Fisher's "omnibus" procedure for combining onesided statistical tests [7]. Hertz and Stormo introduced two methods to approximate the significance of motif matches, one based on largedeviation theory and one purely numerical [8]. Nagarajan et al. extended the numerical method of Hertz and Stormo using a cyclic shifted Fast Fourier Transform [12].
Obtaining match pvalues from database frequencies
Assuming the sequences being scanned are a sample from a known population of sequences, we can query that population directly to obtain the pvalue associated with any match score. For example, suppose we are searching a set S of 10 kb sequences immediately upstream of the transcription start site (TSS) of a set of human genes. The population D would be the entire set of 10 kb upstream regions in the human genome. For a scoring matrix M, and a match score t, the probability that a randomly selected site from S scores at least t when matched with M is
where S and D are treated as the sets of sites they contain, and N_{D }is the number of possible sites in D. The ideal way to evaluate Equation 1 is to scan all sequences in D and count the sites with scores at least as great as t. In most applications, however, D is prohibitively large and this ideal solution is impractical.
We describe here a technique that preprocesses a sequence database to create a summary data structure encoding sufficient information to accurately estimate the probability described in Equation 1. We refer to these summary data structures as (g, k)tables, because they tabulate all gspaced kmers (defined below) in the sequence database. Given a matrix and a cutoff score, a (g, k)table allows rapid estimation of the pvalue for that score. Similarly, the (g, k)table allows a cutoff score associated with a given pvalue to be computed. This process is illustrated in Figure 1.
Figure 1. Converting between pvalues and match scores using (g, k)tables. Based on a sequence database, a (g, k)table encodes sufficient information to calculate match score pvalues, or produce a score cutoff corresponding to a given pvalue. Details of this process are described in the text.
Preprocessing sequence databases as (g, k)tables
A gspaced kmer is a sequence segment that consists of k/2 contiguous nucleotides, a space of width g following position k/2, and the next k/2 contiguous nucleotides starting at position k/2 + g + 1 (enforcing k to be even). As an example, the 5spaced 6mer ACGTCA occurs in the following sequence beginning at the fourth position:
TAG
 ACG
 TCA
Although it is possible to construct (g, k)tables from other formations of words, this format of a gspaced kmer can adequately represent the majority of binding sites present in the TRANSFAC [13] and Jaspar [14] databases. A sequence database D may be either a single long sequence or multiple sequences. Both cases are treated similarly. The (g, k)table for database D contains the number of times each (g'spaced kmer occurs in a sequence of D, for 0 ≤ g' ≤ g. For a given word w of width k and containing g spaces, the (g, k)table documents the number of times w occurs in D with at most g central spaces. For larger values of k and g, a (g, k)table contains more information about D, providing more accurate pvalue estimates with more of the positions in a matrix considered exactly. However, the total size of the (g, k)table will be larger.
Calculating expected frequencies of long words
To calculate the expected frequency of words longer than the words tabulated from the database, we first look for the most informative positions in the matrix, with the information content of positions in the matrix defined as in [15]. If the tabulated frequencies are of size k, we find the most informative positions in the matrix that are in the format of a gspaced kmer: k/2gapk/2. All kmers that can be generated from these positions are enumerated using a branch and bound algorithm. The frequency of each kmer above threshold is then looked up in the (g, k)table. To estimate the expected frequencies of the longer words we use the nucleotide content of the database to create a Markov model of order k/2. With this, the consecutive positions of the matrix that are not part of the k most informative positions contribute to the expected frequency in a way that is proportional to their frequency in the promoter database.
Converting between pvalues and scores
Given a scoring matrix M, and a match score t, let W (M, t) be the set of all words w such that score(M, w) ≥ t. Define N_{est }(w, D) as the estimated number of times the word w occurs in sequences of D. We estimate the probability from Equation 1 as
We now describe how, for a given word w, we obtain an accurate value for N_{est }(w, D) using a (g, k)table. We first identify the k positions in M with the greatest information content, under the restriction that those positions must match the format of a gspaced kmer. Let w' be the gspaced kmer in w corresponding to the aforementioned most informative positions in M. Using the (g, k)table, we obtain the exact number of times w' occurs in D. This number of occurrences is then multiplied by the probability of observing the nucleotides at remaining positions in w given the nucleotides at positions in w', as described above.
Implementing this technique, for a given matrix, requires enumerating all words with a match score greater than the specified cutoff. Such enumeration is feasible for the majority of binding site matrices. If a pvalue is given instead of a threshold, the process is reversed: the pvalue is converted into an occurrence frequency, and the (g, k)table is used to estimate a score threshold for which we will observe that occurrence frequency in the database. When scanning sequences, each site constitutes a distinct trial, and therefore, the pvalues must be corrected for multiple testing. The simple Bonferroni correction [16], which assumes independence between trials, may be used in this case. Any dependencies introduced by autocorrelation are accounted for by the counts tabulated in the (g, k)tables: autocorrelating words will have greater counts in the database, and this is directly reflected in the (g, k)table.
Accuracy of (g, k)table calculations
To assess our ability to determine pvalues of binding site predictions using (g, k)tables, we carried out the following tests. Tables for two different size promoters: small (500 to +100 w.r.t TSS) and large (5000 to +500 w.r.t. TSS), and two sizes of kmers: 6mers and 8mers, were constructed for all human promoters in CSHLmpd. The subset of TRANSFAC (v9.3, [13]) matrices described used for testing was all vertebrate matrices with integer counts.
Each matrix in the subset was scanned on both of the human promoter sets and thresholds for site matches were set at a functional depth of 0.9. A functional depth (described in the Methods section) of 0.9 was used here because all matrices have a reasonable match threshold corresponding to this functional depth. The empirical pvalue corresponding to this functional depth was then calculated as the number of site matches divided by the total number of windows of the size of the matrix contained in the particular promoter set. The matrix threshold corresponding to a functional depth of 0.9 was then converted to a pvalue using the appropriate (g, k)table and an error was calculated as the difference between the empirical pvalue calculated as above and the pvalue calculated using the (g, k)table. The average error for all matrices tested for both sizes of promoters and kmers is shown in Table 1.
Table 1. Error in the estimation of pvalues using (g, k)tables.
We also wanted to measure the error created with the approach to pvalue calculation as implemented in the ESAsearch program. To test this software, we first calculated an empirical pvalue for each matrix in the manner described above. This pvalue was then used as an input parameter for ESAsearch to scan for sites with at least this probability. The resulting number of sites found by ESAsearch was used to calculate the empirical pvalue resulting for the program and the error in the process was calculated as the difference in the two pvalues. The average errors produced by the ESAsearch algorithm are listed in Table 1.
The error in the calculations using the (g, k)tables decreases as we use larger promoters and larger kmers to construct the (g, k)tables, which is expected. As more information is used in the calculation, the errors involved will be reduced. More suggestions on the proper selection of sequences to construct a (g, k)table are included in the Discussion section. We also tested the NMksite software [6] in its ability to carry out this procedure. For the two cases it was tested on (both small and large promoters), out of the 524 matrices that were tested, NMksite could not return thresholds for pvalues small enough for 148 and 132 of the matrices respectively. This is due to the fact that NMksite produces a distribution of thresholds and pvalues and if the desired pvalue is smaller than any of the pvalues listed by NMksite, the user is unable to get a corresponding threshold. If we ignored the matrices NMksite could not determine thresholds for, the errors for the small and large promoter sets were 1.30e2 and 1.35e2 respectively, which are much larger than the errors introduced by either the ESAsearch or (g, k)table method.
Statistical significance of module sites
As mentioned previously, it is often the case that binding sites occur together in clusters as cisregulatory modules and these modules can occur in regions very far from the genes they regulate. In this section, we extend the method described above for the estimation of statistical significance of single binding sites to the statistical significance of modules.
To assess the statistical significance of gene clusters, Hoberman et al. [17,18] developed statistics for gene clusters based on a maxgap parameter, which is defined as the maximum distance allowed between any two genes in a cluster. In the following sections, we outline how the maxgap technique can be extended to assessing the statistical significance of clusters of transcription factor binding sites. Once we have estimated the statistical significance of the clustering of motif sites, we can combine pvalues for individual sites with the clustering pvalue to get an overall significance estimation for cisregulatory modules.
Previous work on scanning for cisregulatory modules
The algorithms that have been introduced to find cisregulatory modules can be divided into four types of algorithms: (1) algorithms for de novo module discovery, (2) algorithms that attempt to cluster binding sites found with traditional methods into modules, (3) algorithms that use phylogenetic information and (4) algorithms that scan for a predefined module profile. Algorithms for de novo discovery of modules include ModuleSearcher [19], CoBind [20], Bioprospector [21], MITRA [22], dyad search [23], CisModule [24], EMCModule [25] and the Gibbs Module Sampler [26]. Algorithms that combine single sites together to form modules include the logistic regression analysis method introduced by Wasserman and Fickett [27], MSCAN [28,29], COMET [30,31], MCAST [32], Composite Module Analysis [33], Ahab [34], ModuleFinder [35] and methods based on estimating the clustering significance of single sites under a Poisson distribution model of motif sites used by Lifanov et al. [36] and Wagner [3739]. In addition to these methods, a number of recent algorithms have used conservation to try to predict cisregulatory modules [4044]. Algorithms designed specifically for scanning include CisModScan [45], Target Explorer [46] and ModuleScanner [19]. We will discuss these algorithms further when comparing their performance to our method.
The original work by Wasserman and Fickett [27] took into account the overall organization of a module in that their technique has the ability to learn the order of the motifs in a module and the method used by Klingenhoff et al. considered the distances between single sites in a module [47]. A few of the more recent algorithms for de novo module discovery have also begun to take into account the organization of modules. For example, the algorithms introduced by Gupta and Liu [25] and Thompson et al. [26] take into account the distances between motif sites in a module. These methods, however, are not built for scanning through sequences for defined modules but rather for de novo discovery of modules. None of the existing algorithms that allow for identifying and estimating the statistical significance of module sites allow the user to search for modules with defined organization.
Statistical significance of modules without constrained organization
In this section we deal with the calculation of statistical significance of motif clusters with no consideration given to overall module organization. Hoberman et al. [17,18] considered the clustering of a chain of genes where genes were treated as points in a chain. Therefore, they did not have to deal with the lengths of genes or the lengths of gaps between genes in a cluster. For our situation, we are considering motifs in a sequence window and the lengths of motifs are very important. Our treatment of the maxgap statistics diverges from that of Hoberman et al. due to this. An example of a maxgap cluster of m binding sites of width w in a sequence of length l with a maxgap g can be seen in Figure 2.
Figure 2. Maxgap cluster – module without organizational constraints. A maxgap cluster of motifs as a module without organizational constraints. The length of the sequence (l) is 16 bases, the number of motifs (m) is four, the widths of each motif (w_{i}) is two bases and the maxgap (g) is three bases.
Counting arrangements
The number of ways of creating a maxgap cluster of m motifs, a maximum gap of g and total length l is equivalent to rolling to m  1 dice, with faces numbered 0 to g, such that the sum of their faces is equal to l  ∑_{i }w_{i}, where 0 <i ≤ m. This is, in turn, equivalent to rolling to m  1 dice with faces numbered 1 to g + 1 with the sum of their faces equal to l  ∑_{i }(w_{i } 1)  1. Using this analogy, Hoberman et al. [17,18] derived an expression for the number of ways to construct a maxgap cluster of m genes anywhere in a sequence of length l. When considering motifs with widths, the expression
counts the number of ways to form a maxgap cluster of m factors with widths w_{i}, in a sequence of length l and a maxgap parameter g. The differences in this expression and the expression derived by Hoberman et al. are first, in Hoberman et al., the sum of all the gaps was equal to l  1 while here the sum of all the gaps is equal to l  ∑_{i }(w_{i } 1)  1 and second, we multiply the entire expression by m! because a group of factors in a unique order qualifies as a unique module.
Probabilities for the occurrence of clusters
With the ability to enumerate all possible ways to place m factors with widths w_{i }in a sequence of length l such that the maximum possible gap allowed between any two factors is g we can calculate a pvalue for the maxgap configuration. Under the null hypothesis that each site is uniformly distributed in the sequence, the pvalue for the maxgap configuration is
The numerator in this equation is given by Equation 2 with the m! factor canceled by the denominator. The denominator in this equation is the number of ways to place m nonoverlapping motifs in a sequence of length l (see 1), with the m! factor canceled by the numerator.
Additional file 1. Appendix. The number of ways to place m motifs with widths in a sequence of length l.
Format: PDF Size: 56KB Download file
This file can be viewed with: Adobe Acrobat Reader
Statistical significance of modules with constrained organization
Here we extend the treatment of modules without organizational constraints to deal with the organizational parameters of the ordering of motifs, the spacing between the motifs and the orientation of the motifs in a module. Any or all of these organizational parameters can be specified when searching for a given module. In this section we outline methods to calculate statistical significance of module sites with any or all of these organizational parameters specified. In each case, the overall probability of the cluster with no organization is first calculated, and then this probability is corrected for the organizational constraints. An example of a module with specified organizational parameters can be seen in Figure 3. We note that a smaller module with organization can exist within a larger module without organization.
Figure 3. Maxgap cluster – module with organizational constraints. Maxgap cluster of motifs as a module with organizational constraints. There are four motifs in this module that must occur in the order A,B,C,D. The spacings between the motifs are defined as s_{x }and the orientations of the motifs are labeled with arrows above the motifs.
Modules with order specified
We again define m as the number of motifs, g as the maxgap parameter and l as the total length of the window. The total number of permutations of m factors is simply m!. Therefore, if the order of the factors in a module is specified, along with a maxgap parameter, the probability of the module is:
Pr(gap ≤ g{m, w, l}, r) = Pr(gap ≤ g, m, w, l) P (r),
where Pr(gap ≤ gm, w, l) is given by Equation 3 and,
Modules with spacing specified
The correction for the probability of a module site when the spacing between any two pairs of factors is known can be calculated as
Pr(gap ≤ g{m, w, l}, r, s) = Pr(gap ≤ g, m, w, l) P (r) P (s),
where Pr(gap ≤ gm, w, l) is given by Equation 3, P (r) is given by Equation 4 and
where m_{s }is the number of spacings which are defined in the organization and span_{i }is the size of the allowed span between the two factors flanking the i^{th }spacing (e.g. span_{i }= 25 if the spacing between motif A and motif B is known to be between 10 and 35 bases). In this situation we have also corrected for the order of the motifs in the module being specified. In general, when the spacing between two factors is defined, the order of the factors will be defined as well.
Modules with strands specified
If the strands of the motifs in the module are specified, then the final probability of the module is simply multiplied by 1/2 for every motif that has a specified strand. Therefore, if the organization of a module has specified strands, the final probability for the module is
Pr(gap ≤ g{m, w, l}, d) = Pr(gap ≤ gm, w, l) P (d)
where Pr(gap ≤ gm, w, l) is given by Equation 3 and,
where m_{d }is the number of motifs for which strand information is known.
Combining significance values from individual sites and module structure
With pvalues calculated for individual sites as well as for the clustering property of the module, we can combine these significance values to get an overall module significance. There are a number of ways that this can be done depending on the situation. If it is desirable to have the individual sites contribute more to the overall module score, the pvalues for the individual sites can be weighted accordingly. Conversely, if the clustering property is more important, the probability of clustering can be weighted more. The most straightforward way to combine the pvalues is just to take the product of the individual pvalues as the overall probability value and apply a Bonferroni correction. The technique we choose to combine the pvalues in MODSTORM is the technique used by Bailey and Gribskov [48].
Scanning for the IFNβ enhancer
The IFNβ enhancer is a classic example of an module with organizational constraints, occurring in the promoter of the IFNβ gene [49,50]. To examine the ability of MODSTORM to find modules with organization in the human genome, we scanned all intergenic regions in the human genome with a module representing the IFNβ enhancer (for details see 2). Scanning for this module using MODSTORM and a module without any constrained organizational parameters finds nine occurrences of this module genomewide. By enforcing the organization, only one occurrence is found and it is immediately upstream of the IFNβ gene. The location of this predicted module is shown in Figure 4. The pvalue for this module site is 6.74e12. This module site is also found when searching for the module without defined organizational parameters. However, the module pvalue in this case is only 2.13e9.
Additional file 2. IFNβ Enhancer. Details on the module used to scan for the IFNβ enhancer.
Format: PDF Size: 65KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 4. IFNβ enhancer. A screen shot of the IFNβ enhancer found with MODSTORM as a track on the UCSC genome browser (Human Mar. 2006 (hg18) assembly [55]). The top track spans the entire length of the module and is labeled with the module significance. The individual motifs occurrences are shown below. The predicted location of this site is consistent with the experimentally verified site.
To compare MODSTORM with other existing module scanning tools, we scanned for the IFNβ module in a sequence set consisting of all large promoters (5000 to +500 w.r.t. TSS). The MODSTORM software, with or without organization, scans through all of these sequences and finds the one true occurrence of this module in roughly two minutes and 30 seconds on a modern LINUX workstation. CisModScan takes 30 minutes on the same machine to repeat the same task and is not able to find the true module site. Sosinsky et al. tested Target Explorer on the task of finding the IFNβ module. Target Explorer was used to scan for windows of 50 bp within which at least one binding site for cJun, ATF2, IRF, and NFκB occurred and it was able to find a window where all occurred in the human interferonbeta promoter [46]. Target Explorer does not attempt to assign any statistical significance to module occurrences. Aerts et al. also applied the ModuleScanner algorithm to the problem of finding the IFNβ enhancer. They constructed a module consisting of motifs for NFκB, IRF1 and HMGIY. ModuleScanner found many occurrences of this module, with the fourth most significant hit being upstream of the human interferonbeta gene. In general, algorithms that just scan for module occurrences will be much faster than any of the algorithms that attempt to cluster together single sites from a library of known matrices. To illustrate this, we tested the COMET software for its ability to find the IFNβ enhancer. COMET, given the four PFMs that make up the IFNβ module as a library, scans for two hours, again on the same machine. While it does find some clusters containing three or four matrices, COMET does not find the true site.
Implementation of STORM and MODSTORM
The method of using a sequence database to convert between score cutoffs and pvalues was implemented in STORM, the Search Tool for Occurrences of Regulatory Motifs. STORM searches sequences for sites that match a scoring matrix, and is ideally suited for searching for multiple motifs simultaneously because the sequences are preprocessed as suffix trees [51]. Once the suffix trees are constructed, scoring matrices are matched along the suffix trees using a branch and bound strategy, making the time to search with each matrix proportional to the width of the matrix, instead of the length of the sequence.
The MODSTORM software preprocesses sequences into suffix trees in the same manner as STORM. When searching for modules without organizational constraints, the motifs making up the module are ranked by information content. The highest information content motif is scanned for first. If occurrences of this motif are found, the remaining motifs are searched for in a window surrounding the first located motif. The motifs are searched for in order of information content. If at any step, the motif being searched for is not found in the window, the search for the remaining factors is abandoned and the scanning for the highest information content motif is resumed.
When searching for modules with organizational constraints, the 5' most motif is scanned for first. If an occurrence of this is detected, a window size defined by the organization of the module is scanned for the next motif in the module and so on. If any one of the motifs is not discovered, the scanning for the 5' most motif is resumed.
Discussion
Importance of choosing the correct database to build (g, k)tables
For pvalues to be informative they must be calculated relative to the appropriate sequence database. When searching for sites in 1 kb proximal promoters from the rat genome, the most appropriate database to use is the complete set of 1 kb rat promoters. When scanning through human intergenic regions that have been masked using RepeatMasker [52], the most appropriate background sequence set to use in constructing a (g, k)table is the complete set of masked human intergenic regions. One common problem in transcription factor binding site detection is the presence of CpG islands. To address this issue, tables can be constructed for CpG and nonCpG sequence and the appropriate table can be used when scanning in regions where the CpG status is known. In addition to CpG islands, there are numerous features of genomic sequence (such as repetitive regions) that could affect the statistics calculated using (g, k)tables. The selection of a proper background set to create (g, k)tables when analyzing such regions is imperative to the interpretation of the results. In addition to the software which creates (g, k)tables, we provide tables calculated for human, mouse and rat promoters from the CSHL mammalian promoter database for three sizes: small (500 to +100 w.r.t TSS), medium (1000 to +300 w.r.t. TSS) and large (5000 to +500 w.r.t. TSS) as well as tables constructed from all intergenic regions in human, mouse and rat. It is important to stress the difference between statistical and biological significance of kmers in genomic sequence. While this method will find kmers that are statistically significant, wet lab experiments must be performed to determine biological significance.
Finding optimal modules with organizational parameters
When scanning for modules with organizational parameters, the current method will not necessarily find the optimal scoring module. A scenario exists where weaker motifs existing in advantageous organization would score higher than stronger motifs in less advantageous organization in the same genomic window. Our algorithm is based on finding motifs first and then assessing organization and would thus return the stronger motifs with the weaker organization as the module hit from this region.
Modifications of our current algorithm which performs the module search in a more iterative fashion would be better able to deal with the situation of weaker motifs having strong organization. For example, after finding module sites, the threshold scores of the motifs comprising the module could be lowered and the module region could be searched for each motif again. If lower scoring motifs are found, the module could be optimized by its overall significance by considering motif scores and organization.
Extension of MODSTORM to module discovery
The MODSTORM software could easily be extended to library based module discovery algorithm. In this case, a library of known motifs would be scanned across every window and motif combinations that lead to significant modules would be retained. This is an interesting problem and there are multiple existing software tools that attempt to solve this problem, several of which have been mentioned above. We choose to focus here on the efficient scanning for defined modules as this is an area of research that is lacking.
Conclusion
We have introduced novel methods to aid in the detection and estimation of statistical significance for both single sites and modules of transcription factor binding sites. Our method of using real promoters to estimate statistical parameters provides an intuitive way to interpret the threshold for binding site matches. The techniques introduced to calculate the statistical significance of clusters of transcription factor binding sites allow for the statistical evaluation of putative cisregulatory modules. The search algorithms and statistical significance calculations implemented in the software tools we provide allow for the scanning of large intergenic sequences for single transcription factor binding sites as well as cisregulatory modules. These software tools have the potential to greatly aid in the in silico study of transcriptional regulation.
Methods
Other scoring methods allowed in STORM and MODSTORM
In addition to scoring putative motif site matches with pvalues as described above, we have also incorporated a number of other scoring schemes into the STORM and MODSTORM software. The additional scoring methods include standard PWM threshold scoring, the percentage of maximum score, and the functional depth.
A common and intuitive approach for cutoff selection takes the maximum possible score derived from a given matrix, and sets the threshold for occurrence at some percentage of the maximum score. This idea was shown to have merit by Tronche et al., [53] who showed that HNF1 sites with a score greater than 83% of the maximum score of the scoring matrix showed experimental evidence of binding. Although this score of 83% of the maximum score works well for HNF1, it is expected that different factors will have different percentages of maximum scores allowed for binding.
The term "functional depth" was first introduced by Beer and Tavazoie [54] as a term to represent the threshold above which functional binding will occur for a particular factor represented by a PWM. We define the functional depth for a PWM score S as,
where S_{min }and S_{max }are the minimum and maximum scores possible for the PWM.
Availability and requirements
Project name: STORM
Project home page: http://rulai.cshl.edu/storm webcite
Operating system(s): Linux/Unix
Programming language: C++
Other requirements: The STORM and MODSTORM programs are available as part of the Comprehensive Regulatory Element Analysis and Discovery (CREAD) package, see http://rulai.cshl.edu/cread/ webcite for more information.
License: GNU GPL
Any restrictions to use by nonacademics: None
Authors' contributions
DS conducted the experiments; DS and AS contributed to writing of software; all authors contributed to design of the methodology and writing of the manuscript.
Acknowledgements
We thank Xiaoyue Zhao, Chaolin Zhang, Zhenyu Xuan and Sitharthan Kamalakaran for numerous discussions and critical reading of the manuscript. We also thank an anonymous reviewer for helpful comments. This work was supported by National Institutes of Health Grant HG001696 and National Science Foundation Grant EIA0324292.
References

Webber A, Ingram R, Levorse J, Tilghman S: Location of enhancers is essential for the imprinting of H19 and Igf2 genes.
Nature 1998, 391:711715. PubMed Abstract  Publisher Full Text

Leighton P, Saam J, Ingram R, Stewart C, Tilghman S: An enhancer deletion affects both H19 and Igf2 expression.
Genes Dev 1995, 9:20792089. PubMed Abstract

Xuan Z, Zhao F, Wang J, Chen G, Zhang MQ: GenomeWide Promoter Extraction and Analysis in Human, Mouse and Rat.

Staden R: Methods for calculating the probabilities of finding patterns in sequences.
Computer Applications in the Biosciences 1989, 5(2):8996. PubMed Abstract

Claverie JM: Some Useful Statistical Properties of PositionWeight Matrices.
Computers Chem 1994, 18(3):287294. Publisher Full Text

Claverie JM, Audic S: The statistical significance of nucleotide positionweight matrix matches.
Computer Applications in the Biosciences 1996, 12(5):431439. PubMed Abstract

Baily TL, Gribskov M: Methods and statistics for combining motif match scores.
Journal of Computational Biology 1998, 5:211221. PubMed Abstract

Hertz G, Stormo G: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.
Bioinformatics 1999, 15(7):563577. PubMed Abstract  Publisher Full Text

Stormo GD: DNA binding sites: representation and discovery.
Bioinformatics 2000, 16:1623. PubMed Abstract  Publisher Full Text

Rahmann S, Muller T, Vingron M: On the Power of Profiles for Transcription Factor Binding Site Detection.
Statistical Applications in Genetics and Molecular Biology 2003, 2.

Beckstette M, Stothmann D, Homann R, Giegerich R, Kurtz S: PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices using Enhanced Suffix Arrays.
Proceedings of the German Conference in Bioinformatics 2004, 5364.

Nagarajan N, Jones N, Keich U: Computing the Pvalue of the information content from an alignment of multiple sequences.
Bioinformatics 2005, 21:i311i318. PubMed Abstract  Publisher Full Text

Matys V, KelMargoulis OV, Fricke E, Liebich I, Land S, BarreDirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, LewickiPotapov B, Saxel H, Kel AE, E W: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.
Nucleic Acids Research 2006, 34:D108110. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B: JASPAR: an open access database for eukaryotic transcription factor binding profiles.
Nucleic Acids Research 2004, 32:D9194. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schneider TD, Stormo GD, Gold L, Ehrenfeucht A: Information content of binding sites on nucleotide sequences.
Journal of Molecular Biology 1986, 188(3):415431. PubMed Abstract  Publisher Full Text

Sokal RR, Rohlf FJ: Biometry. W.H. Freeman and Company; 2001.

Hoberman R, Sankoff D, Durand D: The Statistical Significance of MaxGap Clusters. In Proceedings of the 2nd RECOMB Workshop on Comparative Genomics. Volume 3388. SpringerVerlag; 2005::5571.

Hoberman R, Sankoff D, Durand D: The Statistical Analysis of Spatially Clustered Geens under the Maximum Gap Criterion.
Journal of Computational Biology 2005, 12(8):10831102. PubMed Abstract  Publisher Full Text

Aerts S, Loo PV, Thijs G, Moreau Y, Moor BD: Computational detection of cisregulatory modules.
Bioinformatics 2003, 19(Suppl. 2):ii5ii14. PubMed Abstract  Publisher Full Text

GuhaThakurta D, Stormo GD: Identifying target sites for cooperatively binding factors.
Bioinformatics 2001, 17(7):608621. PubMed Abstract  Publisher Full Text

Liu JS, Liu X, Brutlag DL: BIOPROSPECTOR: Discovering Conserved DNA motifs in upstream regulatory regions of coexpressed genes.
Proceedings of the Pacific Symposium on Biocomputing 2001, 6:127138.

Eskin E, Pevzner PA: Finding composite regulatory patterns in DNA sequences.
Bioinformatics 2002, 18:S354S363. PubMed Abstract  Publisher Full Text

van Helden J, Andre B, ColladoVides J: Discovering regulatory elements in noncoding sequences by analysis of spaced dyads.
Nucleic Acids Research 2000, 28(8):18081818. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhou Q, Wong WH: CisModule: de novo discovery of cisregulatory modules by hierarchical mixture modeling.
Proc Natl Acad Sci USA 2004, 101(33):1211412119. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gupta M, Liu J: De novo cisregulatory module elicitation for eukaryotic genomes.
Proceedings of the National Academy of Sciences of the USA 2005, 102(20):70797084. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Thompson W, Palumbo MJ, Wasserman WW, Liu JS, Lawrence CE: Decoding Human Regulatory Circuits.
Genome Research 2004, 19671974. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wasserman WW, Fickett JW: Identification of Regulatory Regions which Confer MuscleSpecific Gene Expression.
J Mol Biol 1998, 278:167181. PubMed Abstract  Publisher Full Text

Johansson O, Alkema WB, Wasserman WW, Lagergren J: Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm.
Bioinformatics 2003, 19(Suppl. 1):i169il76. PubMed Abstract  Publisher Full Text

Alkema WB, Johansson O, Lagergren J, Wasserman WW: MSCAN: identification of functional clusters of transcription factor binding sites.

Frith MC, Hansen U, Weng Z: Detection of ciselement clusters in higher eukaryotic DNA.
Bioinformatics 2001, 17(10):878889. PubMed Abstract  Publisher Full Text

Frith MC, Spouge JL, Hansen U, Weng Z: Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences.
Nucleic Acids Research 2002, 30(14):32143224. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bailey TL, Noble WS: Searching for statistically significant regulatory modules.
Bioinformatics 2003., 19(Suppl. 2) PubMed Abstract  Publisher Full Text

Kel A, Konovalova T, Waleev T, Cheremushkin E, KelMargoulis O, Wingender E: Composite Module Analyst: a fitnessbased tools for identification of transcription factor binding site combinations.
Bioinformatics 2006, 22(10):11901197. PubMed Abstract  Publisher Full Text

Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic cisregulatory modules applied to body patterning in the early Drosophila embryo.
BMC Bioinformatics 2002., 3(30) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Philippakis AA, He FS, Bulyk ML: MODULEFINDER: A Tool for Computational Discovery of Cis Regulatory Modules.
Proceedings of the Pacific Symposium on Biocomputing 2005, 519530.

Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory clusters in Drosophila.
Genome Res 2003, 13(4):579588. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wagner A: A computational genomics approach to the identification of gene networks.
Nucleic Acids Res 1997, 25(18):35943604. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wagner A: A computational "genome walk" technique to identify regulatory interactions in gene networks.
Pac Symp Biocomput 1998, 264278. PubMed Abstract

Wagner A: Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes.
Bioinformatics 1999, 15(10):776784. PubMed Abstract  Publisher Full Text

Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding site clustering to identify cisregulatory modules involved in pattern formation in the Drosophila genome.
Proceedings of the National Academy of Sciences of the USA 2002, 99(2):757762. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Berman BP, Pfeiffer BD, Laverty TR, Salzberg SL, Rubin GM, Eisen MB, Celniker SE: Computational identification of developmental enhancers: conservation and function of transcription factor bindingsite clusters in Drosophila melanogaster and Drosophila pseudoobscura.
Genome Biol 2004, 5(9):R61. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Sinha S, Schroeder MD, Unnerstall U, Gaul U, Siggia ED: Crossspecies comparison significantly improves genomewide prediction of cisregulatory modules in Drosophila.
BMC Bioinformatics 2004, 5:129. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarily acting cisregulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D. pseudoobscura.
Bioinformatics 2004, 20(16):27382750. PubMed Abstract  Publisher Full Text

Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, Lefebvre C, Deblois G, Giguere V, Ferretti V, Bergeron D, Coulombe B, Robert F: Genomewide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression.
Genome Research 2006, 16:656668. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Johnson DS, Zhou Q, Yagi K, Satoh N, Wong W, Sidow A: De novo discovery of a tissuespecific gene regulatory module in a chordate.
Genome Research 2005, 15:13151324. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sosinsky A, Bonin CP, Mann RS, Honig B: Target Explorer: an automated tool for the identification of new target genes for a specified set of transcription factors.
Nucleic Acids Research 2003, 31(13):35893592. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Klingenhoff A, Freeh K, Quandt K, Werner T: Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity.
Bioinformatics 1999, 15(3):180186. PubMed Abstract  Publisher Full Text

Bailey TL, Gribskov M: Combining evidence using pvalues: Application to sequence homology searches.
Bioinformatics 1998, 14:4854. PubMed Abstract  Publisher Full Text

Munshi N, Yie J, Senger K, Lomvardas S, Agalioti T, Thanos D: The IFNβ Enhancer: A Paradigm for Understanding Activation and Repression of Inducible Gene Expression.
Cold Spring Harbor Symposia on Quantitative Biology 1999, LXIV:149159. Publisher Full Text

Struhl K: A Paradigm for Precision.
Science 2001, 293:10541055. PubMed Abstract  Publisher Full Text

McCreight EM: A SpaceEconomical Suffix Tree Construction Algorithm.
Journal of the Association for Computing Machinery 1976, 23(2):262272.

Smit A, Hubley R, Green P: RepeatMasker Open3.0. 1996–2004. [http://www.repeatmasker.org] webcite

Tronche F, Ringeisen F, Blumenfeld M, Yaniv M, Pontogilo M: Analysis of the Distribution of Binding Sites for a Tissuespecific Transcription Factor in the Vertebrate Genome.
Journal of Molecular Biology 1997, 266:234245. Publisher Full Text

Beer MA, Tavazoie S: Predicting Gene Expression from Sequence.
Cell 2004, 117:185198. PubMed Abstract  Publisher Full Text

Kent W, Sugnet CW, Furey TS, Roskin K, Pringle TH, Zahler AM, Haussler D: The Human Genome Browser at UCSC. [http://genome.ucsc.edu] webcite
Genome Research 2002, 12(6):9961006. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Beckstette M, Homann R, Glegerich R, Kurtz S: Fast index based algorithms and software for matching position specific scoring matrices.
BMC Bioinformatics 2006., 7(389) PubMed Abstract  Publisher Full Text  PubMed Central Full Text