A fast weak motif-finding algorithm based on community detection in graphs

Jia, Caiyan; Carson, Matthew B; Yu, Jian

doi:10.1186/1471-2105-14-227

Methodology article
Open access
Published: 17 July 2013

A fast weak motif-finding algorithm based on community detection in graphs

Caiyan Jia^1,2,
Matthew B Carson² &
Jian Yu¹

BMC Bioinformatics volume 14, Article number: 227 (2013) Cite this article

6493 Accesses
9 Citations
2 Altmetric
Metrics details

Abstract

Background

Identification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application.

Results

In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal.

Conclusions

Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.

Background

Transcription factors play an irreplaceable role in the activation and repression of gene expression by binding to specific sites within promoter regions of target genes. Identification of transcription factor binding sites (TFBSs) is a basic task for elucidating the molecular mechanisms of transcriptional regulation. Although traditional footprinting assays can accurately identify the precise binding sites of any factor, this low-throughput method is highly technical and can analyze only a single small region (< 1 kb) at a time. With the development of high-throughput sequencing technologies, a number of experimental techniques such as ChIP-chip and ChIP-seq have been used to identify the location of transcription factor binding sites. However, these methods are unable to resolve DNA-protein interactions at base pair resolution [1]. In silico identification of over-represented DNA motifs from the promoters of co-regulated or homologous genes as well as ChIP-enriched regions plays a significant role in locating binding sites in a high-throughput and high-resolution manner.

Since a DNA motif is usually highly conserved or over-represented among DNA sequences, there are two main approaches to its representation: (1) represent a motif by its profile or position-specific scoring matrix (PSSM) [n_j,k]_l×4, which records the frequency of base k (k∈{A,C,G,T}) at position j (j={1,2,⋯,l}) for all aligned sites [2-4] or (2) characterize a motif as an l-length consensus string describing a motif with the most frequent nucleotide in each position of all aligned sites. According to these two TFBS models, existing motif-finding algorithms can be divided into two classes. The first includes algorithms that maximize a statistical or entropy-related score of a PSSM [n_j,k]_l×4. CONSENSUS [5], MEME [6], Gibbs Sampler [7], AlignACE [8], PROJECTION [9], and CRMD [10] belong to this group. These algorithms use optimization techniques from the fields of statistics and machine learning, including the greedy strategy [5], the Expectation-Maximization method [6], Gibbs sampling methods [7-9], and the clustering method [10]. These algorithms usually have a fast run time. Sometimes, however, they cannot converge to the global optimum, especially for short motifs with high levels of statistical noise or long motifs with a large search space. The second class of algorithms usually searches for (l, d) motifs based on the consensus model [11] by employing heuristic methods, but in some cases use optimal techniques. It is supposed that each sequence contains zero, one, or multiple motif instance(s) with up to d mutations within a true motif [12]. A large number of algorithms have been proposed to exactly or almost exactly extract (l, d) motifs from N input sequences with length L. SPELLER [13], WEEDER [14, 15], MITRA-count [16], Voting [17], PMSprune [18], WINNOWER [11], iTriplet [19], VINE [20], Stemming [21], RecMotif [22], and sMCL-WMR [23] are included in this group. These algorithms usually have a high time complexity for long motifs. This limits their application, especially toward prokaryotic promoters [24, 25]. In this study, we intend to offer a highly efficient algorithm for finding long and weak (l, d) motifs and to use this algorithm to identify TFBSs in prokaryotic data sets.

Motif-finding algorithms based on the consensus model can be further divided into two categories: pattern-driven and sample-driven approaches [16]. Using a pattern-driven approach, one tries to enumerate all possible 4^ll-mer motifs with lexical order. When applying a sample-driven approach, all possible (l, d) motifs generated from the real l-mers of input sequences are often tested. SPELLER, WEEDER, and MITRA-count are pattern-driven approaches and Voting, PMPprune, WINNOWER, iTriple, VINE, Stemming, RecMotif, and sMCL-WMR are sample-driven. In general, pattern-driven approaches can automatically find (l, d) motifs in samples without being given the length l. However, the state-of-the-art sample-driven approaches are usually faster than the state-of-the-art pattern-driven approaches, and thus can be used to extract motifs with a larger l and d.

Recently, the sample-driven approach, which transforms the (l,d) motif search by extracting the maximum clique or q-cliques (q≤N) from an N-partite graph, has attracted much attention. In this graph, each vertex is an l-mer. There is an edge between two l-mers from two different sequences if the Hamming distance between these two l-mers is no more than 2d. This is because the Hamming distance between each instance of a motif and the motif itself is assumed to be at most d. Thus, all instances of a motif must form a maximum clique or q-cliques in the graph. This idea was first presented in WINNOWER, which utilizes an extendable mechanism to cut off spurious edges by extending k-cliques (k=2 or k=3) to larger (k+1)-cliques. However, the accuracy of WINNOWER cannot be guaranteed since there is strong background noise in many sequences and true edges may be pruned by its local extension mechanism. The VINE algorithm, with its rigorous pruning steps, was proposed to speed up and increase the precision rate of WINNOWER. Similarly, iTriplet was designed to randomly select two reference sequences and identify all triplets (3-cliques) in these as well as each of the remaining N−2 sequences. All discovered triplets along with their sequence information are then inserted into hash tables as candidate motifs. If a table has enough instances (e.g., at least q), the motif can be identified as ‘true’. RecMotif was created to extract N-cliques by using reference sequences as well. It takes the selected reference vertices from the first x (x={1,2,⋯,N}) reference sequences in order to select new reference vertices in the remaining sequences. As x is increased, the selection is continued (x←x+1) if new reference vertices can be selected from the remaining sequences to obtain x-cliques. Otherwise, the algorithm backtracks to the first x−1 reference vertices and finds a substitute. RecMotif has been shown to be very fast for some (l, d) cases (e.g., (15, 4), (18, 5) and (21, 6)). However, in tests performed by Sun et al. [22], it failed to find some weak motifs such as (15, 5), (18, 6), and (19, 7). Additionally, RecMotif operates under the OOPS constraint. During real applications, some sequences may not contain any instance of a true motif while some may contain multiple instances. With this work, we offer a more efficient algorithm for extracting long and weak (l, d) motifs from N-partite graphs using the more biologically-relevant ZOMOPS constraint.

During our research, we made the following observations: (1) there may be too many spurious edges in an N-partite graph to extract the q-cliques (q≤N) needed to identify a weak motif (e.g., (15, 4) or (18, 6)) and (2) if we construct the N-partite graph such that there is an edge between two l-mers from two different sequences if the Hamming distance between these two l-mers is no more than x (d≤x≤2d) instead of exactly 2d, the motif instances in the graph may form a dense subgraph instead of a clique. Based on this information, we present a new algorithm: TFBSGroup. It first extracts dense subgraphs, which are groups of candidate instances (i.e., TFBSs), by the fast community detection algorithm BGLL [26]. It then greedily refines these candidate motifs towards the true motif within their own communities. Our empirical study shows that TFBSGroup can quickly discover long and weak (l, d) motifs (e.g., (18, 6) and (24, 8) motifs within 30 seconds) in synthetical samples under the ZOMOPS constraint. More importantly, it is able to rapidly identify motifs in a large data set of prokaryotic promoters [25] generated from the Escherichia coli database RegulonDB [27]. It is also able to accurately discover motifs in 12 mouse transcription factor ChIP-seq data sets involved in ES cell pluripotency and self-renewal [28].

Results

We first tested TFBSGroup on a series of synthetic (l, d) samples and compared it with iTriplet (source code: http://www.rci.rutgers.edu/∼gundersn/iTriplet/) and RecMotif (source code provided by the authors). iTriplet and RecMotif are both sample-driven algorithms which heuristically extract q-cliques from an N-partite graph (q=N for RecMotif because of the OOPS constraint). Meanwhile, we compared TFBSGroup with the pattern-driven algorithms SPELLER, WEEDER, and MITRA in order to reveal more differences between sample-driven and pattern-driven approaches. We then used TFBSGroup on a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB for the purpose of finding real long and weak motifs. Also, we showed the results of TFBSGroup in discovering motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal [28]. All experiments were performed on a computer with an Intel 2.99 GHz processor and 2GB of main memory running the Windows XP.

Benchmark data sets

Like the previous work of Buhler and Tompa [9], the testing samples were generated synthetically using the following steps:

1)
A parent motif of length l was chosen by randomly picking l bases from the nucleotide set {A,C,G,T}.
2)
N i.i.d. background sequences of length L were constructed at random.
3)
q (q≤N) sequences were randomly selected from these N background sequences.
4)
The following steps were performed for each of the selected q background sequences:
1. 4.1)
  An instance of the parent motif was created by randomly choosing d (d<l) positions of the motif and randomly mutating these d bases to one of the four nucleotides.
2. 4.2)
  A consecutive substring of random length l was selected from each background sequence.
3. 4.3)
  The substring in Step 4.2 was replaced with a newly generated instance of the motif.

In our experiments, unless otherwise specified, the number N and the length L of sequences in an (l, d) sample are set to 20 and 600, respectively.

Comparisons between TFBSGroup and state-of-art algorithms using (l, d) samples

Firstly, to show the efficiency of TFBSGroup, we compared it with state-of-the-art algorithms including the pattern-driven SPELLER, WEEDER, and MITRA-count (MITRA for short) and the sample-driven iTriplet and RecMotif on the same testing samples (Table 1). Secondly, we tested the effect of the maximal length L of DNA sequences [20]. The test results are shown in Table 2. WEEDER(q) indicates the execution time of WEEDER given q, q/f indicates that WEEDER failed to find the true motif for the given value q, TFBSGroup(x) indicates the run time of TFBSGroup given x (d≤x≤2d), s, m, and h denote seconds, minutes, and hours respectively, and ^′−^′ indicates a run time of over 10h.

Table 1 Comparisons of TFBSGroup with state-of-art algorithms on ( l , d ) samples

Full size table

Table 2 The influence of sequence length l on (15, 4)

Full size table

To demonstrate the accuracy of our algorithm, we ran TFBSgroup on 100 randomly generated test samples for each (l, d) pair and reported the number of samples in which the implanted motifs were correctly reported in the top 1/5/10, and which were correctly identified but listed below the top 10 (denoted b10). We also reported the number of samples in which the implanted motifs were not correctly reported by TFBSGroup (denoted f), since our algorithm TFBSGroup may fail in some cases. The accuracy of TFBSGroup for different (l, d) pairs is shown as 1/5/10/b10/f after TFBSGroup(x) in Tables 1 and 2. For example, we correctly found motifs ranked first in 95 samples, motifs ranked within the top 5 in 98 samples, and motifs ranked within the top 10 in 99 samples for (15, 4). However, we failed on one sample set. Thus, the accuracy of TFBSGroup on 100 (15, 4) samples can be estimated as 95/98/99/0/1, where motifs were ranked by their significance score [14, 15]. Furthermore, since the run times of TFBSGroup on different samples of the same (l, d) pair showed negligible difference (usually <1 second) under the same parameter settings, we did not average the run times of 100 samples for an (l, d) pair but instead kept the run times of TFBSGroup on the same sample sets used in the efficiency comparisons with other algorithms.

In addition, Table 1 and Table 2 describe the results of VINE (Huang et al. [20]) and sMCL-WMR (Boucher and King [23], sMCL for short) in order to compare these sample-driven algorithms, which extract cliques from N-partite graphs, to our work. ^′∗^′ indicates that no result was available from literature. The experiments using VINE were performed on a PC with an Intel Pentium IV 3.40 GHz processor and 1GB of main memory running Windows. Those for sMCL-WMR were performed on a Linux machine with a 2.6 GHz processor and 1Gb of RAM running Ubuntu Linux. We also tested iTriplet and found that the run times of two implementations of this algorithm on the same (l, d) sample were greatly different due to its random mechanism for selecting two reference sequences and an l-mer within a reference sequence. Taking five runs of iTriplet on the same (15, 4) sample as an example, the minimum execution time was 0.859 seconds and the maximum was 9.156 seconds. We have reported the average run time of 5 runs of iTriplet.

As shown in Table 1, the sample-driven algorithms run faster than the pattern-driven variety. However, except for MITRA, we used only q and d as input in all implementations of the pattern-driven algorithms. The length l of planted motifs can be predicted by these algorithms while all the sample-driven types tested above and MITRA require l, d and q to be specified in advance. TFBSGroup can find all long and weak (l, d) motifs within 30 seconds under the ZOMOPS constraint. Most planted motifs in the synthetic samples (with the exception of (10, 2)) were correctly reported in the top 1/5/10. TFBSGroup performed with high accuracy when identifying exact matches for long and weak (l, d) motifs such as (15, 4), (16, 5), (18, 6), and (19, 7). However, TFBSGroup failed to find exact planted motifs in many cases involving conserved motifs (e.g., (10, 2)). This may be because the networks are too sparse to form good communities. For hard (l, d) motif search problems such as (15, 5), (16, 5), (17, 6), and (18, 6), TFBSGroup is much more efficient than iTriplet, RecMotif, VINE, and sMCL-WMR. In addition, it is not easy to tune the parameter q in WEEDER since, with the decrease of q, the run time is dramatically increased. Moreover, according to our experiments, iTriplet cannot be guaranteed to find true inserted motifs in all cases because of its random mechanism. Also, the memory usage of iTriplet is much higher and can freeze the computer during searches for long and weak motifs such as (19, 7).

Table 1 and Table 2 show that the sample-driven algorithms are more sensitive to the length L of DNA sequences, which influences the scale and the noise ratio of an N-partite graph given x. The pattern-driven approach is much more sensitive to l and d, which dominate the scale of the search space. These results are consistent with the time complexity of these algorithms collected from [21, 22] and shown in Table 3. According to Table 1 and Table 2, the value of q has a strong influence on the efficiency of WEEDER. For instance, when L=700, the true motif could not be found when we set q=19. We then let q=18 and ran WEEDER again. The run time when q=18 was two to three times longer than when q=19. We also observed that a shorter length L of sequences led to a more accurate TFBSGroup result. This is consistent with the theoretical results in Zia and Moses [29].

Table 3 Algorithmic complexity

Full size table

Mining for transcription factor binding sites in Escherichia coliK-12

In order to further evaluate TFBSGroup, we used the algorithm on a large data set (ECRDB70-10) to find the binding sites within promoters of Escherichia coli K-12 DNA. This data, collected by Hu et al. [25], is stored in RegulonDB [27] and contains groups of sequences with experimentally-determined binding sites in the middle regions of the sequences. We selected sequence sets with more than five DNA sequences. We used published motif consensuses in the literature, especially the results in Li et al. [30], as a guide for inferring the values of l and d. We listed the motif consensuses, which were similar to the consensuses published in the literature. If no exact match or similar result was found in the literature, we listed the top ranked motif consensuses with the most binding locations in the middle regions of the sequences. The test results (obtained within 1 minute by running TFBSGroup) are shown in Figures 1 and 2, where ‘TF’ indicates the name of the transcription factor, ‘ #’ indicates the number of sequences in the corresponding set, ‘Consensus’ indicates the motif consensus of the corresponding TFBSs given a TF, ‘Logo’ indicates the sequence logo of all TFBSs for a specified TF (created using the web-based application tool Weblogo [31]), ‘Rank’ indicates the ranking number of the significance score [14, 15] for the motif consensuses, ‘Lit.’ indicates that similar motif consensuses have been published in the literature while ‘*’ means no similar motif consensus was found in the literature, and (l,d) and x are represented in the same way as in Table 1.

As shown Figures 1 and 2, TFBSGroup can efficiently find over-represented long motifs from prokaryotic promoters. We illustrate this point using the well-studied TFs CRP, FNR, and LexA as examples [21]. Binding site data for the CRP protein includes 138 DNA sequences of length 219 with the consensus TGTGAnnnnnnTCACA (consensus model: (18, 7)) and 138 actual binding sites. The FNR data includes 50 DNA sequences of length 222 with the consensus TTGATnnnnATCAA (consensus model: (14, 4)) and 50 actual binding sites. The LexA data includes 10 DNA sequences of length 222 with the consensus CTGTnnnnnnnnnnCAG (consensus model: (16, 6)) and 10 actual binding sites. For all three sets, the expected motifs are ranked first by TFBSGroup in terms of their significance score: CRP is reported to have 121 sites (63 true), FNR is predicted to have 46 sites (23 true), and LexA is reported to have 12 sites (8 true). The precision on the site level ( $precision = \frac{TP}{TP + FP}$ ) is close to or greater than 50% on these three data sets, where TP is the number of true positive sites and FP is the number of false positive sites. It should be pointed out, however, that some results marked with an ‘*’ in Figures 1 and 2 may not be satisfactory due to the low specificity of binding sites for the TFs, insufficient number of sequences from which to a draw statistical conclusion, or a lack of knowledge of the proper (l, d) models. Compounding the problem is the fact that the true consensuses in these data sets are unknown, a difficulty which exists for all consensus model-based algorithms.

Motif discovery in 12 mouse ES CELL ChIP-seq data sets

To further evaluate the accuracy of motifs predicted by TFBSGroup, we analyzed the ChIP-seq data sets for 12 DNA-binding TFs (CTCF, cMyc, Esrrb, Klf4, Nanog, nMyc, Oct4, Smad1, Sox2, STAT3, Tcfcp2I1, and Zfx) involved in mouse embryonic stem cell pluripotency and self-renewal [28]. To prepare the data sets for use with motif discovery algorithms, we first extracted peak regions from ChIP-seq data using MACS [50] with a FDR (false discovery rate) threshold of 0.2. We then mapped the centers of the ChIP-seq peaks to the mouse mm10 assembly and extracted 100-bp of genomic sequence centered around each peak. To compare motifs identified by TFBSGroup with motifs found in Chen et al. [28], we ran TFBSGroup on hundreds of peaks with low p-values. The results of Chen et al. and TFBSGroup are shown in Figure 3, where all sequence logos predicted by TFBSGroup, including those in Figure 4, were also created using the web-based application tool Weblogo [31]. We found motifs matching those identified in Chen et al. [28]. Specifically, motif logos predicted by Chen et al. [28] and TFBSGroup for each TF in Figure 3 (with the exception of Klf4 and Zfx) are exactly or ‘almost’ exactly the same. However, it is well known that Klf4 is able to recognize GC-rich regions. ZFX has no known published consensus sequence, but its predicted motif agrees to some extent with the result of Chen et al. and the result predicted by cEMRMIT [51].

In [28], Chen et al. used WEEDER and then refined and extended the motifs with an Expectation-Maximization method. This second step was necessary because the supplied version of the WEEDER algorithm limited the motif search to a maximum of 12 bps. As discussed in the previous sections, WEEDER operated with low efficiency for long motifs and was difficult to tune for the parameter q. On the contrary, TFBSGroup was able to find both long and weak motifs. We obtained the motifs and their TFBS locations in sequences within 1 minute for all data sets with just one run of TFBSGroup.

In addition, we found alternative motifs for some TFs such as OCT4, Esrrb and CTCF, which were also reported in a previous study [52]. One significant alternative motif for each of the three TFs is shown in Figure 4. The TFBS sequences of this alternative motif were complementary to those of the main motif in Figure 3 for each of three TFs.

Conclusions and discussion

In this work, we have developed a novel and efficient algorithm (TFBSGroup) for identifying (l, d) motifs under the ZOMOP constraint. It extracts dense subgraphs from an N-partite graph using a fast community detection algorithm designed for processing large-scale networks (BGLL). Based on these extracted communities, TFBSGroup heuristically refines candidate motifs and their instances towards the true motifs. Experimental tests on synthetical samples have shown that TFBSGroup can very quickly discover long and weak (l, d) motifs and their instances. More importantly, TFBSGroup has achieved good performance in rapidly identifying motifs in a large data set of promoters generated from Escherichia coli and in accurately discovering motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. Still, TFBSgroup may not work well in the extreme case that the number of mutations between each motif instance and the motif itself is exactly d, since the graph will be too dense to be partitioned sufficiently. Fortunately, this case seldom occurs in real applications. In the future, we plan to improve the algorithm by combining structure- and sequence-based methods in order to address this issue. Also, we plan to improve the algorithm’s ability to process large-scale ChIP-Seq data sets.

Methods

(l, d) motif search and dense subgraph extraction

Given a set of sequences S={s₁,s₂,…,s_N} over a symbol set Σ={A,C,G,T} and positive integers l and d (|s_i|≤L, 1≤i≤N, 1≤l≤L and 0≤d<l), an (l,d) motif search finds a string t∈Σ^l such that for at least q (q≤N) sequences {s_i1,s_i2,⋯,s_{i
q}}⊆S there exists a substring t_{i
j} in each s_{i
j} (j=1,2,⋯,q) with d(t,t_{i
j})≤d, where d(t,t_{i
j}) indicates the Hamming distance between the two strings t and t_{i
j}.

Since the Hamming distance between each instance of a motif and the motif itself is at most d, the Hamming distance between two instances is no more than 2d and all instances of the true motif must form a q-clique. Therefore, we can obtain (l, d) motifs by extracting q-cliques from an N-partite graph where each vertex is an l-mer in S and there is an edge between two l-mers l_i and l_j (l_i and l_j are l-mers of s_i and s_j, respectively, i≠j) if the Hamming distance between the two l-mers is no more than 2d.

As far as synthetic samples randomly generated by the method mentioned in the above section are concerned, the probability of two random l-mers with a maximum distance of x is

p_{(l, x)} = \sum_{i = 0}^{x} (\begin{array}{l} i \\ l \end{array}) {\frac{3}{4}}^{i} \cdot {\frac{1}{4}}^{(l - i)} .

(1)

Thus, for a set of N sequences with length L, there are 0.5×N×(L−l+1)×(N−1)×(L−l+1)×p_(l,2d) random edges in the background sequences. For example, there are an estimated 18.2 million random edges in the background sequences for an (18, 6) sample when N=20 and L=600. There are also some spurious edges around the vertices of motif instances, especially for long and weak (l, d) motifs, since the neighbor vertices of a motif instance may have links to the neighbor vertices of other motif instances. Still, only q∗(q−1)/2≤N∗(N−1)/2 edges are true positive links forming an expected q-clique.

Suppose there is an edge between two vertices in an N-partite graph if the Hamming distance of the vertices is no more than x (d≤x≤2d) instead of 2d. In this case, the number of spurious edges may be dramatically reduced. For example, if we set x=7 for a real (18, 6) sample, there are only 82,343 edges in the N-partite graph (there are an estimated 80,335 edges using Eq. 1). However, the instances of a motif in this situation may not form a q-clique but rather a densely connected subgraph. We can obtain an (l, d) motif by detecting dense subgraphs in an N-partite graph where the distance of two vertices is at most x.

Community detection and dense subgraph identification

In recent years, complex network analysis has been highlighted in the research community. It is a powerful tool used to describe the structure of many complex systems in nature and society and has many potential applications. A network is usually represented by a graph G=(V,E), where V is a set of n vertices and E is a set of m edges representing relationships between pairs of vertices.

Community structure is one of the most important topological characteristics in a network. A community structure is a subgraph of a network whose vertices are more highly connected with each other than with vertices outside the subgraph. Therefore, the problem of community detection requires the partition of a network into communities of densely connected nodes such that ∀u,v,u≠v,C_u∩C_v=∅ and ∪_uC_u=V, where C_u (or C_v) is one of the partitioned communities. It should be apparent that community structure is a type of dense subgraph. The current algorithms for identifying communities in complex networks can be used to find dense subgraphs within graphs. Many methods to identify community structures in complex networks have been developed [53, 54]. As mentioned above, however, an N-partite graph of input sequences with a long and weak (l, d) motif may be a large network with millions of edges. Fast community detection methods are required to partition a large-scale graph. In the field of complex network analysis, algorithms including Infomap [55], BGLL [26], LPA [56], and RG [57] are designed to efficiently detect communities in very large networks. In this study, we use the BGLL algorithm [26] to find dense subgraphs in an N-partite graph. This algorithm is best for our purposes since we only need a coarse partition and BGLL is very fast and easy to use. The source code for this software can be obtained from http://findcommunities.googlepages.com/.

BGLL: a near linear time algorithm for community detection

BGLL is a heuristic method for optimizing modularity (Eq. 2), which measures the difference between the empirical distribution of in-community connections of a partition and the expected distribution of in-community connections of a partition in a randomly generated graph with the same vertex degree distribution [58].

Q = \frac{1}{2 m} \sum_{i, j \in V} [A_{ij} - \frac{k_{i} k_{j}}{2 m}] δ (C_{i}, C_{j}),

(2)

where A_{i
j} is the weight of the edge (i,j). If a network is unweighted, A_{i
j}=1 for (i,j)∈E, otherwise A_{i
j}=0. $k_{i} = \sum_{j = 1}^{n} A_{ij}$ is the sum of the weights of the edges attached to vertex i or the degree of the node i for an unweighted network. C_i is the community to which the node i is assigned and δ(·,·) is the Kronecker function. A larger Q yields a better partition.

The BGLL algorithm can be divided into two iterative phases. Firstly, it assigns a different community to each node of a network. Then, for each node i, it considers the neighbors j of i ((i,j)∈E) and evaluates the gain of modularity Δ Q (Eq. 3) that would take place by removing i from its community and placing it in the community of j.

\begin{array}{l} ΔQ = [\frac{\sum_{in} + 2 k_{i, in}}{2 m} - {(\frac{\sum_{tot} + k_{i}}{2 m})}^{2}] \\ - [\frac{\sum_{in}}{2 m} - {(\frac{\sum_{tot}}{2 m})}^{2} - {(\frac{k_{i}}{2 m})}^{2}], \end{array}

(3)

where $\sum_{in}$ is the sum of the weights of the edges inside a community C_u, $\sum_{tot}$ is the sum of the weights of the edges incident to nodes in C_u, and k_{i,i n} is the sum of the weights of the edges from i to nodes in C_u. If the gain is negative, i stays in its original community, otherwise, i is placed in the community which provides maximum gain. The second phase of the algorithm involves building a new network whose nodes make up the communities found during the first phase. The weights of the edges between the new nodes are the sum of the weight of the edges between nodes in the corresponding two communities. Edges between nodes of the same community lead to self-loops for this community in the new network. The algorithm naturally incorporates a notion of hierarchy, which results in communities of communities.

The BGLL algorithm is extremely fast and performs in linear time on typical and sparse data, since the gain of modularity is very easy to compute with Eq. 2 and the number of communities decreases drastically after just a few passes. Most of the run time lies within the first iteration [26]. In this study, we took the results of the first iteration only since we are interested in obtaining coarse-grained candidate motifs and their group of instances from dense subgraphs of the original network and are not concerned about the hierarchical structure of communities.

TFBSGroup: a fast motif-finding algorithm

TFBSgroup operates in two phases. Firstly, we construct an N-partite graph where the distances of pairs of vertices are no more than x (d ≤ x ≤ 2d) for a set of input sequences, which is assumed to contain an (l, d) motif and at least q (q≤N) instances. We then detect all communities with a size of at least t (the default is q/2) in the N-partite graph using the BGLL algorithm to obtain a candidate motif consensus by aligning all l-mers in each community. Secondly, we greedily refine these candidate motifs toward the true motif using the following three steps: 1) For each candidate motif consensus, we look for the vertices in the neighbor set of the current community C_u for which the Hamming distance between the consensus and the corresponding l-mers of these vertices are no more than d in order to form a new community. A vertex belonging to the new community is in the neighbor set of the current community C_u if the position of the corresponding l-mer of the vertex is in the interval $[max {0, po s_{v_{i}} - l}, min {po s_{v_{i}} + l, L - l}]$ . $po s_{v_{i}}$ is the start position of the corresponding l-mer of a vertex v_i (v_i∈C_u) in the sequence to which it belongs. 2) We align the new community to obtain a new candidate motif consensus. We then iteratively execute step 1 and step 2 until each new candidate consensus cannot be changed. 3) We shift the corresponding l-mers (an l-mer corresponds to a vertex v_i in the final community) of the final candidate motif in the interval $[max {0, po s_{v_{i}} - ⌊ l / 3 ⌋}, min {po s_{v_{i}} + ⌊ l / 3 ⌋, L - l}]$ , since the true motif and its instances may be near the final candidate motif consensus and their instances.

Furthermore, since there may be many false positive motifs, we sort all output motifs according to their statistical significance using the method of Pavesi et al. [14, 15] and then delete the duplicates. Finally, the top k significant motifs and their instances are returned. TFBSGroup is so named because it completes its run after all potential motifs and the groups of their instances (corresponding to the groups of TFBSs in a set of DNA sequences) are reported as output.

The details of the TFBSGroup algorithm are shown below, where (i, j) indicates an instance starting at the j-th position of the i-th sequence s_i. t (default = q/2) is used to filter false positive groups forming small communities during the initial stage. This will not affect the result or the speed of the algorithm in our simulations because the largest group usually has the highest significance score. We can set t=0 to ensure all candidate groups are examined. Furthermore, the window size ⌊l/3⌋ is used to ensure that the predicted TFBSs are within its inserted positions and not around them. Generally, we can identify true motifs when the window size is less than l/3. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/ or Additional file 1.

Algorithm 1: The TFBSGroup Algorithm

Time and space complexity of TFBSGroup

The time complexity of TFBSGroup depends mainly on the first phase of the algorithm, which includes time for constructing an N-partite graph with distance x (d≤x≤2d) and time for extracting communities from the constructed N-partite graph. During the second phase, the algorithm searches through candidate motif consensuses and their instances within each of the communities.

There are at most N×(L−l+1)l-mers for a set of DNA sequences with a length of at most L. Therefore, there are at most N×(L−l+1)×(N−1)×(L−l+1)×l/2=O(N²×L²×l) comparisons for constructing an N-partite graph. During one pass of BGLL, the algorithm computes Δ Q at most t times for each vertex in a network, where t is the maximum number of neighbors of a vertex in the network. The time complexity of BGLL is bound by O(N×L×t×t i m e_{Δ
Q}) for extracting communities from an N-partite graph because there are at most N×(L−l+1) vertices in a network, where t i m e_{Δ
Q} is the time complexity for computing Δ Q. As a result, the time complexity of TFBSGroup for the worst case is

\begin{align} O (t \times N \times L \times tim e_{ΔQ}) & = O (t \times N \times L \times m) \\ ≅ O (p {(l, x)}^{2} \times N^{4} \times L^{4}), \end{align}

where m is the number of edges in a network, m=O(p(l,x) × N² × L²) and t=O(p(l,x) × N × L), as estimated using Eq. 1. However, it should be noted that the above bound can be a substantial overestimate. The time complexity of TFBSGroup is almost equal to the time complexity of BGLL, which is near linear with respect to m in real applications, especially for sparse networks [26].

The space complexity of TFBSGroup is mainly affected by the storage of all l-mers and an N-partite graph, where the distance of two vertices is at most x. Thus, the space complexity of TFBSGroup is O(m)=O(p(l,x)×N²×L²), while it is at least O(p(l,2d)×N²×L²) for previous graph-based algorithms.

The time and space complexity of TFBSGroup and several related algorithms (except for sMCL, since no complexity analysis is available for this algorithm in the literature [23]) are shown in Table 3, where the left half lists pattern-driven algorithms (labeled ‘Pattern-D’) and the right half lists sample-driven algorithms (labeled ‘Sample-D’). $N (l, d) = \sum_{i = 0}^{d} (\begin{array}{l} i \\ l \end{array}) 3^{i}$ and w is the word length, which corresponds to bit length of a processor. The time complexity of RecMotif relies heavily on the value of p(l,2d) (see Table one in Sun et al. [22]). According to Table 3, each algorithm has its own advantages. The time complexity of pattern-driven algorithms is higher but they have lower space complexity. The time complexity of sample-driven algorithms is lower but they generally have higher space complexity. RecMotif is too sensitive to p(l,2d) and L. When p(l,2d) is small, RecMotif runs very fast. However, when p(l,2d) is larger than 0.28, RecMotif can not produce the results within a reasonable amount of time. TFBSGroup is a complement to sample-driven algorithms since it makes a reasonable trade-off between speed and accuracy.

The choice of x

The key problem with TFBSGroup is the selection of the parameter x. If x is too large, the N-partite graph may be too dense to define a community containing only the instances of a motif. If x is too small, the graph is too sparse to form the expected communities and the true group of TFBSs will be missed. In this study, we use an experimental statistical method to estimate x for a specified (l, d) motif search problem. Firstly, for a given l-mer consensus, we created 500 instances of the consensus with the Hamming distance between the consensus and each instance equal to at most d. We then computed the Hamming distance for each pair of instances and counted the number n_y of instance pairs with distance y (y={0,1,2,⋯,2d}) to get the frequency n_y distribution. The center of this distribution should be an estimation of x, i.e., x≥m a x{n_y,y∈{0,1,2,⋯,2d}} or close to this. Taking (18, 6) and (19, 7) as examples, the histograms of the frequency n_y distribution are shown in Figure 5. Using these distributions as a guide, we set x=7 and x=8 for (18, 6) and (19, 7), respectively.

References

Elnitski L, Jin VX, Farnham PJ, Jones SJ: Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Res. 2006, 16 (12): 1455-1464. 10.1101/gr.4140006.
Article CAS PubMed Google Scholar
Jensen ST, Liu XS, Zhou Q, Liu JS: Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Stat Sci. 2004, 19: 188-204. 10.1214/088342304000000107.
Article Google Scholar
Pavesi G, Mauri G, Pesole G: In silico representation and discovery of transcription factor binding sites. Brief Bioinformatics. 2004, 5 (3): 217-236. 10.1093/bib/5.3.217.
Article CAS PubMed Google Scholar
Das MK, Dai HK: A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007, 8 (Suppl 7): S21-10.1186/1471-2105-8-S7-S21.
Article PubMed Central PubMed Google Scholar
Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15: 563-577. 10.1093/bioinformatics/15.7.563.
Article CAS PubMed Google Scholar
Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 1994. 1994, Menlo Park, 28-36.
Google Scholar
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214. 10.1126/science.8211139.
Article CAS PubMed Google Scholar
Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296 (5): 1205-1214. 10.1006/jmbi.2000.3519.
Article CAS PubMed Google Scholar
Buhler J, Tompa M: Finding motifs using random projections. J Comput Biol. 2002, 9 (2): 225-242. 10.1089/10665270252935430.
Article CAS PubMed Google Scholar
Gang L, Chan TM, Leung KS, Lee KH: Cluster refinement algorithm for motif discovery. IEEE/ACM Trans Comput Biol Bioinformatic. 2010, 7 (4): 654-668.
Article Google Scholar
Pevzner P, Sze S: Combinatorial approaches to finding subtle signals in DNA sequences. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, 2000. 2000, California, USA, 269-278.
Google Scholar
Rajasekaran S, Dinh H: A speedup technique for (l, d)-motif finding algorithms. BMC Res Notes. 2011, 4 (54): 10.1186/1756-0500-4-54.
Sagot MF: Spelling approximate repeated or common motifs using a suffix tree. Proc LATIN’98: Theor Inf, LNCS. 1998, 1380: 111-127.
Google Scholar
Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics. 2001, 17 (Suppl 1): 207-214. 10.1093/bioinformatics/17.suppl_1.S207.
Article Google Scholar
Pavesi G, Mereghetti P, Mauri G, Pesole G: Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 2004, 32 (Suppl 2): W199—W203-
PubMed Central PubMed Google Scholar
Eskin E, Pevzner P: Finding composite regulatory patterns in DNA sequences. Bioinformatics. 2002, 18 (Suppl 1): 354-363. 10.1093/bioinformatics/18.suppl_1.S354.
Article Google Scholar
Chin YL, Leung CM: Voting algorithms for discovering long motifs. Proceedings of the Third Asia-Pacific Bioinformatics Conference, 2005. 2005, Singapore, 261-271.
Chapter Google Scholar
Davila J, Balla S, Rajasekaran S: Fast and practical algorithms for planted (l, d) motif search. IEEE/ACM Trans Comput Biol Bioinformatics. 2007, 4: 544-552.
Article CAS Google Scholar
Ho ES, Jakubowski CD, Gunderson SI: iTriplet, a rule-based nucleic acid sequence motif finder. Algorithms Mol Biol. 2009, 4 (14): 10.1186/1748-7188-4-14.
Huang CW, Lee WS, Hsieh SY: An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans Comput Biol Bioinformatics. 2010, 8 (4): 959-975.
Article Google Scholar
Kuksa PP, Pavlovic V: Efficient motif finding algorithms for large-alphabet inputs. BMC Bioinformatics. 2010, 1 (Suppl 8): S1-
Article Google Scholar
Sun HQ, Low MYH, Hsu WJ, Rajapakse JC: RecMotif: a novel fast algorithm for weak motif discovery. BMC Bioinformatics. 2010, 11 (Suppl 11): S8-10.1186/1471-2105-11-S11-S8.
Article PubMed Central PubMed Google Scholar
Boucher C, King J: Fast motif recognition via application of statistical thresholds. BMC Bioinformatics. 2010, 11 (Suppl 1): S11-10.1186/1471-2105-11-S1-S11.
Article PubMed Central PubMed Google Scholar
Tompa M: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23: 137-144. 10.1038/nbt1053.
Article CAS PubMed Google Scholar
Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005, 33 (15): 4899-4913. 10.1093/nar/gki791.
Article PubMed Central CAS PubMed Google Scholar
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E: Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008, 2008 (10): P10008-10.1088/1742-5468/2008/10/P10008.
Article Google Scholar
Salgado H, Castro SG, Antonio AM, Peredo ED, Solano FS, Gil MP, Alonso DG, Jacinto VJ, Zavaleta AS, Martinez CB, Vides JC: RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res. 2004, 32: D303—D306-
Article PubMed Central PubMed Google Scholar
Chen X, Xu H, et al: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133: 1106-1117. 10.1016/j.cell.2008.04.043.
Article CAS PubMed Google Scholar
Zia A, Moses AM: Towards a theoretical understanding of false positives in DNA motif finding. BMC Bioinformatics. 2012, 13 (151): 10.1186/1471-2105-13-151.
Li GJ, Liu BQ, Xu Y: A new framework for identifying cis-regulatory motifs in prokaryotes. Nucleic Acids Res. 2011, 39 (7): e42-10.1093/nar/gkq948.
Article PubMed Central CAS PubMed Google Scholar
Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.
Article PubMed Central CAS PubMed Google Scholar
Favorov AV, Gelfand MS, Gerasimova AV, Mironov AA, Makeev VJ: A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics. 2005, 21 (10): 2240-2245. 10.1093/bioinformatics/bti336.
Article CAS PubMed Google Scholar
Bulyk ML, McGuire AM, Masuda N, Church GM: A motif co-occurrence approach for genome-wide prediction of transcription-factor-binding sites in Escherichia coli. Genome Res. 2004, 14: 201-208. 10.1101/gr.1448004.
Article PubMed Central CAS PubMed Google Scholar
Cameron ADS, Redfield RJ: CRP binding and transcription activation at CRP-S sites. J Mol Biol. 2008, 383: 313-323. 10.1016/j.jmb.2008.08.027.
Article CAS PubMed Google Scholar
Gavigan SA, Nguyen T, Nguyen N, Senear F: Role of multiple CytR binding sites on cooperativity, competition, and induction at the Escherichia coli udp promoter. J Biol Chem. 1999, 274 (23): 16010-16019. 10.1074/jbc.274.23.16010.
Article CAS PubMed Google Scholar
Sammitt CE, Hansen FG, Miller JF, Schaechter M: In vivo studies of DnaA binding to the origin of replication of Escherichia coli. EMBO J. 1989, 8 (3): 989-993.
Google Scholar
Campbell JW, Cronan E: Escherichia coli FadR positively regulates transcription of the fabB fatty acid biosynthetic gene. J Bacteriol. 2001, 183 (20): 5982-5990. 10.1128/JB.183.20.5982-5990.2001.
Article PubMed Central CAS PubMed Google Scholar
Robison K, McGuire AM, Church GM: A comprehensive library of DNA-binding site matrices for 55 proteins applied to complete Escherichia coli K-12 genome. J Mol Biol. 1998, 284: 241-254. 10.1006/jmbi.1998.2160.
Article CAS PubMed Google Scholar
Hengen PN, Bartram S, Stewart LE, Schneider TD: Information analysis of Fis binding sites. Nucleic Acids Res. 1997, 25 (24): 4994-5002. 10.1093/nar/25.24.4994.
Article PubMed Central CAS PubMed Google Scholar
Nègre D, Bidaud CB, Geourjon G, Delèage G, Cozzone AJ, Cortay JC: Definition of a consensus DNA-binding site for the Escherichia coli pleiotropic regulatory protein, FruR. Mol Microbiol. 1996, 21 (2): 257-266. 10.1046/j.1365-2958.1996.6341350.x.
Article PubMed Google Scholar
Lavrrar JL, McIntosh MA: Architecture of a Fur binding site: a comparative analysis. J Bacteriol. 2003, 185 (7): 2194-2202. 10.1128/JB.185.7.2194-2202.2003.
Article PubMed Central CAS PubMed Google Scholar
Tramonti A, Canio MD, Biase DD: GadX/GadW-dependent regulation of the Escherichia coli acid fitness island: transcriptional control at the gadY-gadW divergent promoters and identification of four novel 42 bp GadX/GadW-specific binding sites. Mol Microbiol. 2008, 70 (4): 965-982.
CAS PubMed Google Scholar
Cui Y, Wang Q, Stormo GD, Calvo JM: A consensus sequence for binding of Lrp to DNA. J Bacteriol. 1995, 177 (17): 4872-4880.
PubMed Central CAS PubMed Google Scholar
Danot O, Raibaud O: On the puzzling arrangement of the asymmetric MalT-binding sites in the MalT-dependent promoters. Proc Natl Acad Sci USA. 1993, 90: 10999-11003. 10.1073/pnas.90.23.10999.
Article PubMed Central CAS PubMed Google Scholar
Schneiders T, Levy SB: MarA-mediated transcriptional repression of the rob promoter. J Biol Chem. 2006, 281: 10049-10055. 10.1074/jbc.M512097200.
Article CAS PubMed Google Scholar
Plumbridge J: DNA binding sites for the Mlc and NagC proteins: regulation of nagE, encoding the N-acetylglucosamine-specific transporter in Escherichia coli. Nucleic Acids Res. 2001, 29 (2): 506-514. 10.1093/nar/29.2.506.
Article PubMed Central CAS PubMed Google Scholar
McNicholas PM, Rech SA, Gunsalus RP: Characterization of the ModE DNA-binding sites in the control regions of modABCD and moaABCDE of Escherichia coli. Mol Microbiol. 1997, 23 (3): 515-524. 10.1046/j.1365-2958.1997.d01-1864.x.
Article CAS PubMed Google Scholar
Ames GF, Nikaido K: Nitrogen regulation in Salmonella typhimurium. Identification of a ntrC protein-binding site and definition of a consensus binding sequence. EMBO J. 1985, 4 (2): 539-547.
CAS Google Scholar
Cho BK, Federowicz SA, et al: The PurR regulon in Escherichia coli K-12 MG1655. Nucleic Acids Res. 2011, 39 (15): 6456-6464. 10.1093/nar/gkr307.
Article PubMed Central CAS PubMed Google Scholar
Zhang Y, et al: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9: R137-10.1186/gb-2008-9-9-r137.
Article PubMed Central PubMed Google Scholar
Georgiev S, Boyle AP, Jayasurya K, Ding X, Mukherjee S, Ohler U: Evidence-ranked motif identification. Genome Biol. 2010, 11 (2): R19-10.1186/gb-2010-11-2-r19.
Article PubMed Central PubMed Google Scholar
Sharov AA, Ko MSH: Exhaustive search for over-represented DNA sequence motif with CisFinder. DNA Res. 2009, 16: 261-273. 10.1093/dnares/dsp014.
Article PubMed Central CAS PubMed Google Scholar
Girvan M, Newman MJ: Community structure in social and biological network. Proc Nat Acad Sci. 2002, 99: 7821-7826. 10.1073/pnas.122653799.
Article PubMed Central CAS PubMed Google Scholar
Fortunato S: Community detection in graphs. Phys Rep. 2010, 486: 75-174. 10.1016/j.physrep.2009.11.002.
Article Google Scholar
Rosvall M, Bergstrom CT: Maps of random walks on complex networks reveal community structures. Proc Nat Acad Sci. 2008, 105: 1118-1123. 10.1073/pnas.0706851105.
Article PubMed Central CAS PubMed Google Scholar
Raghavan UN, Albert R, Kumara S: Near linear time algorithm to detect community structure in large-scale networks. Phys Rev E. 2007, 76: 036106-
Article Google Scholar
Ovelgonne M, Schulz AG, Stein M: Randomized greedy modularity optimization for group detection in huge social networks. Proceedings of the fourth SNA-KDD Workshop, KDD 2010, July 25. 2010, Washington D.C USA, 1-9.
Google Scholar
Newman MJ, Girvan M: Finding and evaluating community structure in networks. Phys Rev E. 2004, 69: 026113-
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Nature Science Foundation of China (Grant No. 60905029, 61105055, and 81230086), the Beijing Natural Science Foundation (Grant No. 4112046), and the Fundamental Research Funds for the Central Universities. The authors would like to thank Dr. H. Q. Sun for providing the source code for RecMotif.

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
Caiyan Jia & Jian Yu
Department of Bioengineering/Bioinformatics, University of Illinois at Chicago, Chicago, IL, 60612, USA
Caiyan Jia & Matthew B Carson

Authors

Caiyan Jia
View author publications
You can also search for this author in PubMed Google Scholar
Matthew B Carson
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caiyan Jia.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

CJ initiated the project, analyzed the data, and drafted the manuscript; MC and JY participated in the study design, discussion, and editing of the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1: The C++ Version of TFBSGroup (for WindowsXP).(RAR 4 MB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Jia, C., Carson, M.B. & Yu, J. A fast weak motif-finding algorithm based on community detection in graphs. BMC Bioinformatics 14, 227 (2013). https://doi.org/10.1186/1471-2105-14-227

Download citation

Received: 22 October 2012
Accepted: 12 July 2013
Published: 17 July 2013
DOI: https://doi.org/10.1186/1471-2105-14-227

A fast weak motif-finding algorithm based on community detection in graphs

Abstract

Background

Results

Conclusions

Background

Results

Benchmark data sets

Comparisons between TFBSGroup and state-of-art algorithms using (l, d) samples

Mining for transcription factor binding sites in Escherichia coliK-12

Motif discovery in 12 mouse ES CELL ChIP-seq data sets

Conclusions and discussion

Methods

(l, d) motif search and dense subgraph extraction

Community detection and dense subgraph identification

BGLL: a near linear time algorithm for community detection

TFBSGroup: a fast motif-finding algorithm

Time and space complexity of TFBSGroup

The choice of x

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Additional file 1: The C++ Version of TFBSGroup (for WindowsXP).(RAR 4 MB)

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us