LASAGNA: A novel algorithm for transcription factor binding site alignment

Lee, Chih; Huang, Chun-Hsi

doi:10.1186/1471-2105-14-108

Methodology article
Open access
Published: 24 March 2013

LASAGNA: A novel algorithm for transcription factor binding site alignment

Chih Lee¹ &
Chun-Hsi Huang¹

BMC Bioinformatics volume 14, Article number: 108 (2013) Cite this article

9909 Accesses
21 Citations
5 Altmetric
Metrics details

Abstract

Background

Scientists routinely scan DNA sequences for transcription factor (TF) bindingsites (TFBSs). Most of the available tools rely on position-specific scoringmatrices (PSSMs) constructed from aligned binding sites. Because of theresolutions of assays used to obtain TFBSs, databases such as TRANSFAC,ORegAnno and PAZAR store unaligned variable-length DNA segments containingbinding sites of a TF. These DNA segments need to be aligned to build aPSSM. While the TRANSFAC database provides scoring matrices for TFs, nearly78% of the TFs in the public release do not have matrices available. As workon TFBS alignment algorithms has been limited, it is highly desirable tohave an alignment algorithm tailored to TFBSs.

Results

We designed a novel algorithm named LASAGNA, which is aware of the lengths ofinput TFBSs and utilizes position dependence. Results on 189 TFs of 5species in the TRANSFAC database showed that our method significantlyoutperformed ClustalW2 and MEME. We further compared a PSSM method dependenton LASAGNA to an alignment-free TFBS search method. Results on 89 TFs whosebinding sites can be located in genomes showed that our method issignificantly more precise at fixed recall rates. Finally, we describedLASAGNA-ChIP, a more sophisticated version for ChIP (Chromatinimmunoprecipitation) experiments. Under the one-per-sequence model, itshowed comparable performance with MEME in discovering motifs in ChIP-seqpeak sequences.

Conclusions

We conclude that the LASAGNA algorithm is simple and effective in aligningvariable-length binding sites. It has been integrated into a user-friendlywebtool for TFBS search and visualization called LASAGNA-Search. The toolcurrently stores precomputed PSSM models for 189 TFs and 133 TFs built fromTFBSs in the TRANSFAC Public database (release 7.0) and the ORegAnnodatabase (08Nov10 dump), respectively. The webtool is available athttp://biogrid.engr.uconn.edu/lasagna_search/.

Background

A transcription factor is a protein that regulates the expression of its target genesby physically binding to the promoter regions of these genes. The binding sites of atranscription factor (TF) naturally share similarity with each other. The commonpattern shared among the binding sites of a TF is called a motif. In general, thereare two approaches to computational motif analysis, de novo motif discovery [1-12] and transcription factor binding site (TFBS) search [13-17]. As the name suggests, de novo motif discovery algorithms findover-represented patterns in sequences without prior knowledge of the binding TFs.The input to these algorithms is usually the upstream region sequences of genesputatively co-regulated by one or more common TFs. The output is one or more motifsor patterns whose instances are over-represented in the input sequences. On theother hand, a TFBS search algorithm takes binding site sequences of a TF as input.It learns from these known binding sites and builds a TF model out of them. The TFmodel can then be used to scan sequences for putative binding sites. While the twoapproaches are tightly connected, we focus on the TFBS search problem and assumethat a TF has known binding sites available.

A typical TFBS search algorithm requires aligned TFBSs. This requirement allowssimple representations of TF models. Types of TF models include consensus sequences,position-specific scoring matrices (PSSMs) [18], etc. The PSSM method is a widely used method among the available TFBSsearch algorithms. Given aligned binding sites of a TF, the TF model is essentiallya 4×l matrix, where l is the length of the binding sites.Column i of the matrix stores the scores of matching the i^th letter in a sequence of length l (an l-mer) tonucleotides A, C, G and T, respectively. The score of an l-mer is thencalculated by summing up the scores of letter 1 through letter l. Dependingon the variant of PSSM, the score of A at position i can be the count of Aat position i in the known TFBSs, the log-transformed probability ofobserving A at position i, or any other reasonable number. Onceconstructed, the matrix of a TF can be stored in a database to scan sequences forbinding sites of the TF in the future without resorting to the actual binding sites.In fact, many tools [14, 19-24] depend on matrices stored in at least one of the databases, JASPAR [25], RegulonDB [26] and TRANSFAC [27]. Since a matrix is constructed from aligned binding sites, we cannotoveremphasize the quality of TFBS alignments.

Databases such as JASPAR, TRANSFAC and ORegAnno [28] contain DNA segments bound by TFs. These DNA segments can be seen asTFBSs with some irrelevant bases on one or both sides because of the resolutions oftechniques used to obtain TFBSs. The DNA segments belonging to a TF are thereforeunaligned variable-length sequences. While the DNA segments for most TFs in theJASPAR database are aligned, this is not the case for the TRANSFAC public andORegAnno databases. About 53% (983 out of 1867) of the TFs in the TRANSFAC Publicdatabase (release 7.0) have unaligned variable-length DNA segments. Moreover, nearly78% (1447 out of 1867) of TFs having curated DNA segments do not have a matrix.Focusing on TFs with variable-length DNA segments, about 71% (669 out of 983) ofthem do not have a matrix. On the other hand, the ORegAnno database storesexperimentally validated DNA segments bound by TFs but does not provide matrices.About 31% (175 out of 572) of the TFs therein have variable-length DNA segments. Inthe absence of a matrix, to search for binding sites of these TFs using a matrixdependent tool, one needs to first align the curated DNA segments for each TF. Inthe rest of this paper, we refer to (variable-length) DNA segments containing TFBSsas (variable-length) TFBSs for simplicity reasons.

In this work, we propose a novel TFBS alignment algorithm named LASAGNA (Length-AwareSite Alignment Guided by Nucleotide Association). The algorithm is based on thehypothesis that binding sites of a TF share a core [29], a short and highly conserved stretch of DNA. Hence, a binding site canbe seen as a core with some irrelevant bases on one or both sides. In general,shorter sites tend to contain fewer irrelevant bases and are easier to align. Forthis reason, we progressively align the binding sites from the shortest to thelongest ones. The algorithm further exploits dependence between two positions in abinding site. Dependence between positions has been shown to boost performance ofTFBS search algorithms [13, 16] as well as protein structural motif recognition [30]. To our best knowledge, this idea has never been applied to multiplesequence alignment. We further describe a more sophisticated version, namedLASAGNA-ChIP, for aligning peak sequences produced by ChIP-seq experiments.

To compare algorithms for TFBS alignment, we conduct cross-validation (CV)experiments on 4771 binding sites of 189 TFs across 5 species extracted from theTRANSFAC Public database (release 7.0). We compare LASAGNA to ClustalW2 [31, 32] and MEME [1]. Being a widely used multiple sequence alignment algorithm, ClustalW2 wasused to produce gapped TFBS alignments in creating the MAPPER database [33] as well as to produce both gapped and gapless TFBS alignments in [16]. ClustalW2 and other similar algorithms focus on producing structurallycorrect alignments, while other improved algorithms rely on structural or homologyinformation [34]. ClustalW2 can be viewed as a representative of these algorithms when noinformation other than sequences is available. MEME, on the other hand, is a denovo motif discovery tool, whose input is typically regulatory regions oflength 1,000 bp upstream of the genes presumably controlled by a common TF [35]. Nevertheless, a motif found in the input TFBSs can be used to align theTFBSs. In fact, MEME is employed by the PAZAR database [36] to dynamically align TFBSs and generate PSSMs. We show that LASAGNAsignificantly outperforms ClustalW2 (p-value:1.22×10⁻¹⁵) and MEME (p-value:3.55×10⁻¹⁵).

To scan promoters for new TFBSs based on variable-length known TFBSs, we couple aPSSM method with LASAGNA, denoted by LASAGNA-PSSM. That is, the inputvariable-length TFBSs are aligned by LASAGNA and a PSSM model is built from thealignment. It is useful to compare an alignment-based TFBS search method to analignment-free method. Therefore, we further compare LASAGNA-PSSM to SiTaR [17], which accepts variable-length input TFBSs. To our best knowledge, SiTaRis the only alignment-free method capable of handling variable-length input TFBSs atthe time of writing. Cross-validation results on 90 TFs whose binding sites can belocated in respective genomes indicate that LASAGNA-PSSM is significantly moreprecise at fixed recall rates (p-value: 2.66×10⁻⁸).The recall-precision curve also shows that our method is constantly more precise atany recall rate and more sensitive at any precision.

Finally, we demonstrate the application of LASAGNA-ChIP to ChIP-seq data using 38mouse ChIP-seq experiments. We show that, assuming the one-per-sequence model,LASAGNA-ChIP is comparable to MEME in revealing the motif of the ChIPed TF or itscofactor. For both LASAGNA-ChIP and MEME, the ChIPed TF motif was found in 31experiments, while a cofactor motif was found in 3 experiments. While the twomethods differ in the rest 4 experiments, the found motifs have similar informationcontent and may belong to unknown cofactors.

Methods

We describe our novel alignment algorithm in this section. LASAGNA utilizes a searchmodule to align a new binding site with a partial alignment. Thus, we introduce thesearch module followed by the LASAGNA algorithm.

The search module

The search module of LASAGNA is a function learned from a (partial) TFBSalignment to score l-mers. It considers nucleotide pairs in addition toindividual nucleotides so as to exploit dependence between positions. Weintroduce our choice of the search module, the PSSM model described in [13]. We denote it by ${PSSM}_{K}^{a} (\cdot)$ in this work.

Suppose that a PSSM is constructed from aligned sequences of length l.The score of letter u at position i is given by

M_{i} (u) = log \frac{f_{i} (u)}{f (u)},

where f_i(u) is the probability of observing letter u at positioni and f(u) is the background probability ofseeing letter u. Similarly, the score of a pair of letters(u,v) at position (i,j) is given by

M_{i, j} (u, v) = log \frac{f_{i, j} (u, v)}{f (u, v)},

where f_i,j(u,v) is the probability of observing nucleotide pair(u,v) at position (i,j) andf(u,v) is the background probability of seeingthe pair. The score of s, a sequence of length l, is then

{PSSM}_{K} (s) = \sum_{i = 1}^{l} M_{i} (s_{i}) + \sum_{k = 1}^{K} \sum_{i = 1}^{l - k} M_{i, j} (s_{i}, s_{j}),

(1)

where s_i denotes the i^th letter of s, j=i+k andK is the scope parameter defined in [13]. The parameter K controls how far apart a pair ofnucleotides can be. When K=1, only adjacent nucleotide pairs arescored. We define ${PSSM}_{0} (s) = \sum_{i = 1}^{l} M_{i} (s_{i})$ , that is, we do not score nucleotide pairs whenK=0.

Our search module is a variant of (1). Let

M_{i}^{'} (u) = \{\begin{array}{l} min_{x} M_{i} (x) & if u is the gap letter \\ M_{i} (u) & otherwise \end{array}

and

M_{i, j}^{'} (u, v) = \{\begin{array}{l} min_{x, y} M_{i, j} (x, y) & if u or v is the gap letter \\ M_{i, j} (u, v) & otherwise \end{array} .

The search module is defined as follows:

{PSSM}_{K}^{a} (s) = \sum_{i = 1}^{l} M_{i}^{'} (s_{i}) + \sum_{k = 1}^{K} \sum_{i = 1}^{l - k} M_{i, j}^{'} (s_{i}, s_{j}),

(2)

where superscript a denotes alignment as this module is used in our alignmentalgorithm.

The LASAGNA algorithm

The algorithm is based on the idea that the binding sites of a TF share a commoncore, a conserved short DNA sequence. A binding site can then be seen as a corewith a few irrelevant bases on one or both sides. Assuming that each bindingsite fully contains the core, the shorter a binding site, the fewer irrelevantbases it contains. Therefore, we progressively align the binding sites byaligning the shortest binding site with the already aligned binding sites untilall the binding sites are aligned.

The algorithm takes a set of unaligned binding sites, U, and parameterK_a as inputs. Let A denote the set of aligned binding sites.A binding site in A may have gap letters added to one or both ends as aresult of the alignment. The algorithm works as follows:

1.
Initialize A to {s}, where s, the seed site, a shortest binding site arbitrarily chosen from U. Remove s from U.
2.
(a) Build ${PSSM}_{K_{a}}^{a} (\cdot)$ from A. Let the length of this PSSM be l.
1. (b)
  Remove the shortest binding site s from U.
2. (c)
  Create S, the augmented sequence of s, by adding l−1 gap letters to both ends of s.
3. (d)
  Score each l-mer of S by ${PSSM}_{K_{a}}^{a} (\cdot)$ to find the highest scoring one.
4. (e)
  Let s be its reverse-complement and repeat c-d. That is, the opposite strand is considered.
3.
Add s to A if the highest scoring l-mer resides in s. Otherwise, add its reverse-complement to A. Gap letters are added to one or both ends of sequences in A. This ensures that they are all of the same length and each column of the alignment has at least one non-gap letter.
4.
Repeat 2-3 until U is empty.

In step 2b, there may be more than one shortest binding sites in U. Tobreak the tie, we use ${PSSM}_{K_{a}}^{a} (\cdot)$ to scan each of the shortest ones. The“s” containing the highest scoring l-mer isremoved from U to align with sequences in A. In the unlikelycase of two or more shortest binding sites in U sharing the samehighest score, one is arbitrarily chosen. Figure 1illustrates an iteration of the algorithm.

An alignment may be trimmed before building a PSSM. We describe one way oftrimming aligned TF binding sites using two simple measures. Let l bethe length of the aligned binding sites. We first compute and denote thepercentage of non-gap letters at position i of the alignment byC_i, for i=1,2,…,l. The information content (IC) ateach position is then computed with small sample correction described in [37]. That is,

I C_{i} = max \{0, 2 + \sum_{u \in {A, C, G, T}} f_{i} (u) \underset{2}{log} f_{i} (u) - ê (n_{i})\},

where i∈{1,2,…,l}, n_i is the number of non-gap letters at position i and $ê (\cdot)$ gives the approximated sampling error. Let C_min and I C_min be the cutoff thresholds. The alignment is examined from the leftend to the right until the first position j satisfying both C_j>C_min and I C_j>I C_min is encountered. The positions preceding j are trimmedoff. The trimming is similarly applied to the right end.

LASAGNA for ChIP-seq data

Although LASAGNA is not specifically designed as a de novo motifdiscovery algorithm, a more sophisticated version, named LASAGNA-ChIP, iscapable of handling ChIP-seq data. Here, we refer to the previous section anddescribe the additional steps that are necessary for aligning ChIP-seq peaksequences. The flowchart in Additional file 1 gives anoverview of LASAGNA-ChIP.

Before aligning ChIP-seq peak sequences, each sequence is clipped to 100 bpsurrounding the signal peak. This is a common practice since, for most peaksequences (>90%), the actual TFBS is usually found within 50 bp of the calledpeak [38]. In step 2a, we trim the partial alignment A if it containsmore than two sequences. Unlike TFBSs found in databases such as TRANSFAC, evenafter clipping, a peak sequence contains much more irrelevant bases flanking thecore. The trimming procedure described in the previous section is used, whereC_min (I C_min) is set to the mean C_i (I C_i) over all the columns of A. The resulting alignment is furthertrimmed by IC such that it has at most 15 columns and the columns on both endshave positive IC. In step 2b, if there are more than 5 shortest binding sites inU. Only 5 are arbitrarily chosen to break the tie by similarity to ${PSSM}_{K_{a}}^{a} (\cdot)$ .

The alignment A obtained by the modified procedure is further refined asfollows:

1.
Set T to A trimmed to l columns as described above.
2.
Build ${PSSM}_{K_{a}}^{a} (\cdot)$ out of T.
3.
Initialize R to {}, the refined partial alignment.
4.
For each peak sequence s,
1. (a)
  Create S, the augmented sequence of s, by adding l−1 gap letters to both ends of s.
2. (b)
  Score each l-mer of S by ${PSSM}_{K_{a}}^{a} (\cdot)$ to find the highest scoring one.
3. (c)
  Let s be its reverse-complement and repeat a-b.
4. (d)
  Add s to R if the highest scoring l-mer resides in s. Otherwise, add its reverse-complement to R. Gap letters are added to one or both ends of sequences in R.
5.
Set A to R and repeat 1-5 until the sum of IC across columns of T does not change in 3 iterations.

For ChIP-seq peak sequences, the shortest sequence may miss or contain only afraction of the core. Hence, using the shortest sequence as the seedsite sometimes results in an alignment with less IC. For this reason,five additional sequences are arbitrarily chosen as the seed site toproduce 5 additional alignments. Among the 6 alignments, the one with the mostIC after trimming is chosen as the final alignment.

Scoring a putative binding site

Although a PSSM suggests the length of a putative binding site, we do notrestrict the length of a candidate binding site to the length of the PSSM. Aputative binding site could be of any reasonable length. If a true binding siteis flanked by a few irrelevant bases, this sequence should be given a relativelyhigh score compared to those of non-binding sites. Therefore, to score aputative binding site s, we slide s through the PSSM asdescribed in the previous section. The score of sequence s is given by

{Score}_{K_{s}} (s) = max_{i \in {1, 2, \dots, l + l_{s} - 1}} {PSSM}_{K_{s}} (S_{i : (i + l - 1)}),

(3)

where l is the length of the PSSM, l_s is the length of s, S denotes the augmented sequence ofs with l−1 gap letters on both ends and ${PSSM}_{K_{s}} (\cdot)$ is defined in (1).

Results and discussion

Comparison of alignment algorithms

Data sets

We downloaded all the TF binding sites from the TRANSFAC Public database(release 7.0). The binding sites were grouped by species and TF. Bindingsites having less than 4 nucleotides were discarded. TFs of each specieswere filtered such that each TF has at least 10 binding sites. This ensuresthat each TF has enough binding sites to construct a PSSM. The numbers ofTFs and TFBSs are listed in Table 1.

Table 1 TFBSs in TRANSFAC public database by species

Full size table

To facilitate experiments, we planted each TFBS in a 2000 base randomsequence simulated by a first-order Markov chain of the species in question.Except for Saccharomyces cerevisiae, the Markov chain of a species waslearned from promoter sequences in the UCSC Genome Browser database [39]. For Saccharomyces cerevisiae, the promoter sequences wereretrieved from the SCPD [40] using the yeast gene list in euGenes [41].

Performance assessment and evaluation metrics

Since the purpose of aligning TFBSs is to construct a PSSM, the quality of analignment is best measured by the search performance of the PSSM. Theperformance of a TFBS search method is evaluated by ν-fold CV.Consider a TF with n binding sites. The n TFBSs are firstdivided into ν sets, each of which contains $⌊ \frac{n}{ν} ⌋$ or $⌊ \frac{n}{ν} ⌋$ + 1 TFBSs. At each iteration of the ν-foldCV, one of the ν TFBS sets called the test TFBS set,P_test, is left out. The rest of the TFBSs are aligned to build aPSSM. Each test TFBS in P_test is then planted in a 2000 base random sequence and scannedby the PSSM, scoring each l-mer, where l is the length ofthe test TFBS. We score both the forward and reverse strands of anl-mer and assign the higher score to it. An l-mer isconsidered a hit if it shares more than ⌊l/2⌋ baseswith the test TFBS. The l-mers can then be divided into two sets,H and N, where H is the set of hits andN is considered the set of non-binding sites. The score of thetest TFBS is the highest score of hits in H. For each test TFBSt∈P_test, we find its rank relative to all the non-binding sites inN. Formally, the rank of binding site t equals $1 + | {s \in N | {Score}_{K_{s}} (s) \geq {Score}_{K_{s}} (t)} |$ .

After the ν-fold CV, we end up with n ranks, each ofwhich corresponds to a TFBS. We use the area under the ROC curve (AUC) togauge the quality of alignment. The ROC curve is a plot of true positiverate (TPR) against false positive rate (FPR), displaying the trade-offbetween TPR and FPR. We refer readers to [42] for an introduction to this metric. In this study,ν=10 for all the CV experiments.

Comparison with ClustalW2

In general, gapless alignment is preferred over gapped alignment for aligningTFBSs. Because of the nature of ClustalW2, the alignment of TFBSs maycontain gaps in the middle of some binding sites. This is disadvantageous toClustalW2 as the PSSM method does not allow insertion of gaps into thesequence being scanned. Hence, we turned off gaps by setting the gap openingpenalty parameters to a large value, i.e., we set bothGAPOPEN and PWGAPOPEN to100000. Indeed, results indicated that overall the “gapped”ClustalW2 performs slightly worse than the “gapless” variant(p-value: 0.277). For both LASAGNA and ClustalW2, parameterK_s in Eq. 3 was searched from 0 to min{10,l_min−1} for each TF and the one producing the highest AUC isused, where l_min is the minimal length of the TFBSs. For LASAGNA, parameterK_a of the LASAGNA algorithm was set to K_s as the two parameters are closely related.

We conducted 10-fold CV on each TF. The overall ROC curves are shown inFigure 2. The ROC curves are based on the ranks of4771 TFBSs of 189 TFs. It shows that LASAGNA has invariably higher truepositive rate than ClustalW2. The AUC score was calculated for each TF andfor each method. To gauge the significance of difference, the Wilcoxonsigned-rank test [43] was performed for each species. The tests showed that LASAGNA isconsistently better than ClustalW2 across the 5 species. Table 2 shows the test results. Overall, LASAGNA performedsignificantly better than ClustalW2 in terms of AUC scores. The species-wisep-values shows that LASAGNA is significantly better (<0.05)than ClustalW2 for aligning TFBSs of all the 5 individual species.

Table 2 Species-wise and overall comparisons between LASAGNA andClustalW2

Full size table

To better understand the results, we split the 189 TFs into two groups. Onecontains TFs on which LASAGNA performed better than ClustalW2 and the othercontains the rest of the TFs. Three factors are examined for each TF. Theyare the number of TFBSs, the mean and standard deviation of TFBS length. Foreach factor, we looked for difference between the two groups. Table 3shows the comparisons. It can be seen that LASAGNAproduces better alignments when a TF has fewer binding sites but thedifference is not significant. The mean and standard deviation of TFBSlength are the two more important factors. We believe that LASAGNA iswell-suited for aligning TFBSs that are longer and more variable in length.

Comparison with MEME

The MEME tool in the MEME Suite 4.8.1 was used. The parameterminw, minimal width of motifs, was set to thesmaller of 6 and the minimal length of input TFBSs. The optionrevcomp to search the reverse strand was turnedon. Finally, the parameter minsites was set to thenumber of input TFBSs since a common motif is supposed to appear at leastonce in each TFBS. To ensure that MEME functions properly, binding sitesshorter than 8 bases are padded with gap letters since genomic locations arenot available for most TFBSs.

Table 3 Comparison of two groups of TFs divided according to results onLASAGNA and ClustalW2

Full size table

The experiments were carried out in the same manner as the ClustalW2experiments. The overall ROC curve in Figure 2indicates that LASAGNA has consistently higher true positive rates than MEMEacross different false positive rates. The overall and species-wisecomparisons between LASAGNA and MEME in Table 4 showthat LASAGNA performed significantly better than MEME. To gain some insightsinto the difference between LASAGNA and MEME, we similarly examined thethree factors used to compare LASAGNA and ClustalW2. As seen in Table 5, the number of input TFBSs is the only significant(p-value < 0.05) factor out of the three. The reasons arenot clear but may be investigated in the future. Moreover, it will behelpful to identify other (biologically meaningful) factors that can betterexplain the performance difference.

Table 4 Species-wise and overall comparisons between LASAGNA and MEME

Full size table

Table 5 Comparison of two groups of TFs divided according to results onLASAGNA and MEME

Full size table

Distribution of K_s

In Additional file 2, for LASAGNA, ClustalW2 andMEME, we show the distribution of K_s for a TF by species and conserved domain. Overall, we observethat small values are preferred for all three methods. By visual inspection,LASAGNA appears more similar to MEME than ClustalW2 in the usage ofK_s. It can be seen that the usage of K_s differs among different conserved domains. Related conserveddomains such as ZF-H2C2_2 and ZF-C2H2 display similar patterns. This is notsurprising as conserved domains in a protein are often computationallypredicted. Hence, a protein is likely to possess related conserved domains.While overall the distributions seem method-dependent, we observe that, forZF-H2C2_2 and ZF-C2H2, the distributions center around 4 across all threemethods. Finally, we note that these observations are preliminary and moreTFs are needed to draw statistically sound conclusions.

Comparison of TFBS search methods

Data sets

To compare with an alignment-free TFBS search method, SiTaR, [17], we retrieved real promoter sequences embedding TFBSs.Specifically, we followed the curated location of each binding site in theTRANSFAC Public database (release 7.0) to retrieve the 1000-base sequencesflanking the binding site. We discarded binding sites that cannot be foundin the proximity of the curated locations. The retrieved binding sites weregrouped by TF and TFs having less than 10 binding sites were removed. Afterfiltering, we ended up with 90 TFs and 1751 binding sites. A TF may bepresent in more than one species as homologs and hence the binding sites ofa TF may be located in genomes of multiple species. The species andrespective numbers of binding sites are shown in Table 6.

Table 6 Distribution of the 1751 binding sites of 90 TFs in TRANSFACpublic database

Full size table

Performance assessment and evaluation metrics

To compare with SiTaR [17], we adopt the same ν-fold CV process used tocompare LASAGNA with ClustalW2 and MEME. However, we do not assume that aTFBS search method scores all the l-mers in a promoter sequence,where l is the length of binding sites. Instead, a TFBS searchmethod scans a promoter sequence and predicts a list of binding sites withrespective scores. The predicted binding sites may be of different lengths,which is the case for SiTaR.

We describe how a hit is determined. Let the length of a predicted bindingsite be l and the length of the test TFBS in question be l_s. The predicted binding site is considered a hit to the testTFBS if the overlap between the two sequences is more than ⌊l_s/2⌋ bases as in [17]. In case this is not possible, i.e.,l≤⌊l_s/2⌋, the predicted binding site must be embedded in thetrue one to be deemed a hit.

Using the n ranks of TFBSs from ν-fold CV, we computerecall (true positive rate), precision and the F_β-measure, where β=0.5 as in [17]. Let the recall rate be r. The number of TFBSs recalledby the method is p_T=n×r. Let the number of non-binding sitesor false positives introduced be p_F. The precision is given by $\frac{p_{T}}{p_{T} + p_{F}}$ , while $F_{β} = (1 + β^{2}) \frac{precision \times recall}{β^{2} \times precision + recall}$ .

Comparison with an alignment-free method

We conducted 10-fold CV on the aforementioned 90 TFs. The PSSM methoddependent on LASAGNA (LASAGNA-PSSM) was compared to SiTaR [17]. LASAGNA considered both strands of a sequence when aligningbinding sites. The parameters K_a=K_s were determined in the same way as in comparing LASAGNA toClustalW2. An alignment was trimmed with C_min=0.4 and I C_min=0 before constructing a PSSM as described in the methodsection on the LASAGNA algorithm. The PSSM method uses a cutoff score topredict TFBSs. The cutoff score is set to the minimal score of theconstituting binding sites of the PSSM. The SiTaR method has a mismatchparameter and the maximal value allowed by its webtool is 5. We searched inthe range from 0 to 5 to find the mismatch value giving the highestF_β-measure for each TF.

In terms of the F _β, no significant difference was found between the two methods(p-value: 0.392 [43]). To ensure a fair comparison, we fixed the recall rate for eachTF and compare the precision achieved by LASAGNA-PSSM and SiTaR. The recallrate was set to the lower of the recall rates attained by LASAGNA-PSSM andSiTaR. The TF c-Jun (AC: T00132) was excluded from comparison because SiTaRdid not recover any TFBS. Figure 3a shows the plot ofprecision by LASAGNA-PSSM against that by SiTaR. At fixed recall rates,LASAGNA-PSSM is more precise than SiTaR on 65 out of 89 TFs(p-value: 2.66×10⁻⁸). Figure 3b shows the plots of precision against recall based on all therecalled TFBSs by each method. It can be seen that LASAGNA-PSSM isconstantly more precise than SiTaR at the same recall rate. Moreover,LASAGNA-PSSM recovered substantially more TFBSs than SiTaR at the sameprecision.

Results reported in [17] showed that SiTaR is highly precise and sensitive. Although SiTaRaccepts variable-length binding sites, all the experiments presented in [17] used fixed-length binding sites as inputs. It is therefore notclear how SiTaR performs on TFs having variable-length binding sites. It isalso not clear whether SiTaR preprocesses highly variable-length bindingsites as this was not stated in [17]. These issues however are not the focus of our work.

Application of LASAGNA-ChIP to ChIP-seq data

To demonstrate the use of LASAGNA-ChIP on ChIP-seq data, we retrieved mouseChIP-seq data produced by the Encyclopedia of DNA Elements (ENCODE) project [44] from the UCSC Genome Browser [39]. All the 38 peak files in the Narrow Peaks format that matchespatternhttp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/wgEncode*Tfbs*Pk*weredownloaded on Oct. 12, 2012, where “*” is the wildcard charactermatching zero or more characters. These files give signal peak location besidesstart and end for each peak and hence the corresponding signal files do not needto be processed by a peak-finding algorithm. Four distinct cell types and 17distinct target TFs are present in the 38 ChIP-seq experiments. Additional file3 lists, for each ChIP-seq experiment, the cell,target TF, number of peaks as well as the minimum, maximum, mean and standarddeviation of peak length. We observe that the peak length varies greatly. Themean peak length can be as long as 1124, while the highest standard deviation isnearly 876.

It is useful to know if LASAGNA-ChIP is able to align peak sequences and revealthe motif of the ChIPed TF. To align peak sequences, parameter K_a was searched from 0 to 8 to obtain the alignment with the highestIC. MEME was also used to align peak sequences because it is often the choice ofmethod. In fact, MEME is used by 5 out of 6 tools compared in [45] for ChIP-seq data analysis. The MEME parameters are described insection Comparison of alignment algorithms, where the one-per-sequence model isassumed. To ensure that both methods finish within reasonable time, for eachexperiment, we randomly sampled 300 peaks for alignment. We did not distinguishlarge peaks from small ones because ChIP-seq experiments require large numbersof cells and hence “a small peak could represent very strong binding inonly a subset of the cells” [46].

For each alignment, we searched for the resulting motif in 386 UniPROBE mousemotifs and 398 motifs derived from all the matrices in the TRANSFAC Publicdatabase. The search was accomplished by software TOMTOM [47]. We used Pearson correlation as the distance measure, required aminimal overlap of 5 nucleotides, and set the E-value cut-off to 5. Additionalfile 4 shows, for each ChIP-seq experiment, thesequence logos of motifs found by LASAGNA-ChIP and MEME. The matching motifsfound by TOMTOM are listed under each sequence logo [48] by E-value. In case more than 10 significant motifs were found, onlythe 10 most significant ones were shown. The one matching the ChIPed TF ishighlighted in yellow for each ChIP-seq experiment.

We first notice that overall the motifs found by LASAGNA-ChIP and MEME are verysimilar by visual inspection. No significant difference is observed in terms ofmotif IC (p-value: 0.1252). For both LASAGNA-ChIP and MEME, the ChIPedTF motifs were found for 31 experiments. Among the other 7 experiments are oneMYB in MEL cells, all the ETS1 in CH12 and MEL cells, one JUND in MEL cells, oneMAX in C2C12 cells, all the TBP in CH12 and MEL cells. Interestingly,LASAGNA-ChIP and MEME differ only for 4 out of these 7 experiments. They are oneETS1 in CH12 cells, one MAX in C2C12 cells and two TBP in CH12 and MEL cells.Although LASAGNA-ChIP and MEME differ in these cases, the found motifs stillwarrant further analyses. For instance, the motif for ETS1 in CH12 cells foundby LASAGNA-ChIP resembles the secondary motif of Gabpa, which is a knownparalog.

For the other 3 out of the 7 experiments, LASAGNA-ChIP and MEME produced similarmotifs. The one found for MYB in MEL resembles those of GATA proteins. Thisagrees with a recent study reporting that MYB and GATA-3 cooperatively regulateIL-13 by direct binding to a conserved GATA-3 response element [49]. Since this motif is based on 300 peak sequences, it is likely thatthe two proteins similarly regulate other genes in MEL cells. The motif for ETS1in MEL cells also matches those of GATA proteins. Cooperation between ETS1 andGATA-3 in regulating IL-5 was also suggested [50, 51]. Finally, while the motif for JUND in MEL cells matches two motifs inthe TRANSFAC and UniPROBE databases, the matches are likely false positivessince no literature support was found.

While it is not specifically designed to be a de novo motif discoverymethod, LASAGNA-ChIP aligns all the peak sequences and finds the mostinformative motif. The assumption that a motif instance is present in each peaksequence may not hold for some experiments. Because of several possible bindingmodels [46], two or more motifs may be present in subsets of the peak sequences.Discovery of more than one motif will be enabled for LASAGNA-ChIP in the nearfuture.

LASAGNA is simple and effective

Unlike MEME and similar methods, the order in which the input sequences arealigned is crucial to LASAGNA and ClustalW. ClustalW relies on a guide treebased on pairwise alignments to decide the order. LASAGNA, on the other hand,depends on the length of a sequence and its similarity to the partial alignment.LASAGNA-ChIP is well-suited for a TF whose shortest site misses the core orcontains only a fraction of it. We, however, observed no significant differencebetween LASAGNA and LASAGNA-ChIP on TFBSs in the TRANSFAC Public database. Thisis because, for these TFBSs, a shortest site often fully contains thecore.Hence, our assumption holds true in general.

For ChIP-seq data, the assumption that short sequences contain less irrelevantbases flanking the core may not hold. However, we observe that, under theone-per-sequence model, LASAGNA-ChIP performed comparably well to MEME inaligning ChIP-seq peak sequences. We attempted other orders such as from thelongest sequence to the shortest one and found that aligning the shortestsequence first does have its advantage (data not shown). Also, we note that, for11 out of 38 experiments, the peak sequences are all at least 100 bp (seeAdditional file 3) and hence all the peak sequencesare 100 bp long after clipping. This implies that LASAGNA-ChIP is capable ofhandling sequences of the same length.

LASAGNA-ChIP, MEME and methods alike produce gapless alignments and do have theirlimits. When a TF binds to two cores separated by a variable-length spacer,these methods are expected to align the canonical TFBSs containing spacers ofthe most prevalent length. These binding patterns are also known as two-blockmotifs. Gapped alignment or explicit modeling [52] is needed to correctly align TFBSs of this nature.

Implementation

We have implemented a user-friendly webtool named LASAGNA-Search, which is freelyavailable at http://biogrid.engr.uconn.edu/lasagna_search/. Usefulfeatures include automatic promoter retrieval, visualization of hits locally andat the UCSC Genome Browser, and automatic gene regulatory network constructionbased on significant hits. LASAGNA-Search adopts the LASAGNA-PSSM method andcurrently stores PSSM $_{K_{s}}$ models (PSSM models for short), where K_s is determined by CV experiments, for the 189 TRANSFAC TFssummarized in Table 1 as well as 133 TFs from ORegAnno(08Nov10 dump). In Additional file 5, we list eachmodel with its counterpart for the same TF if one is found in matrices inTRANSFAC Public. We do not evaluate models by IC because higher IC implieshigher specificity but not necessarily higher sensitivity. Comparison withmodels in other databases is beyond the scope of this study but will beinvestigated in the near future.

LASAGNA-Search estimates p-values of PSSM scores empirically because thePSSM model for a TF may score nucleotide pairs in addition to individualnucleotides. When K_s=0, a PSSM score is considered the sum of independent variables andhence the exact p-value can be efficiently computed [53]. Even with this independence assumption, the scores of nucleotidepairs at (1, 2) and (2, 3), for instance, are never independent. Hence, a PSSMscore cannot be seen as the sum of independent variables when nucleotide pairsare scored. The empirical PSSM score distribution of a TF is obtained fromscanning a random sequence simulated byf(u),u∈{A, C, G, T}, wheref(u) is estimated from all the TFBSs used to build thePSSM. LASAGNA-Search focuses on only PSSM scores in the upper 5% and hencescores in the lower 95% are given a p-value of 0.05+. Currently, thesmallest nonzero p-value is 2.5×10⁻⁵ and 0 meansany number less than 2.5×10⁻⁵.

As a case study, we scanned the promoter region of human gene CCL2 (NCBI Gene ID6347), also known as MCP1. CCL2 was arbitrarily chosen by browsing the ORegAnnodatabase [28]. The promoter sequence (-950 to +50 relative to the transcriptionstart site) was automatically retrieved and scanned for binding sites of all 68human TFs with the p-value threshold set to 0.001. Figure 4 displays a partial view of the search results ordered byp-value. The only 3 true positive hits, AP-1, Sp1 and p50 (NFKB1),were found in the top-4 of the list. According to ORegAnno, CCL2 is a targetgene of AP-1, Sp1, NFKB1, STAT1 and GAS, where GAS likely refers to the gammaactivated site bound by STAT1. STAT1, however, is not one of the 68 TFs andhence all the TFs known to regulate CCL2 were recalled. The fact that AP-1, Sp1and p50 regulate CCL2 is also documented in TRANSFAC [27] (T00029, T00759 and T00593). The actual sites (R14639, R14638 andR14640), however, are not in the public release and were not used to build thePSSM models. We note that this case study is for illustration not evaluationpurposes.

Conclusions

We proposed LASAGNA, a novel alignment algorithm specifically designed for aligningvariable-length transcription factor binding sites. Cross-validation results on 189TFs and 4771 TFBSs indicated that LASAGNA significantly outperformed ClustalW2(p-value: 1.22×10⁻¹⁵) and MEME (p-value:3.55×10⁻¹⁵). This is because LASAGNA was specificallydesigned for aligning variable-length TFBSs. Based on the success of LASAGNA, wedeveloped LASAGNA-ChIP, which is capable of handling sequences produced by ChIP-chipand ChIP-seq experiments. While ClustalW2 is better suited for producingstructurally correct alignments, LASAGNA-ChIP, MEME and methods alike can be used toalign sequences produced by ChIP-chip or ChIP-seq experiments.

We compared LASAGNA-PSSM, the PSSM method dependent on LASAGNA, to SiTaR, analignment free TFBS search method. Cross-validation experiments were conducted on1751 TFBSs of 90 TFs for both methods. The results showed that, at fixed recallrates, LASAGNA-PSSM is significantly more precise than SiTaR (p-value:2.66×10⁻⁸). The recall-precision curve showed that ourmethod is constantly more precise at any recall rate or more sensitive at anyprecision.

We conclude that the LASAGNA algorithm is simple and effective in aligningvariable-length binding sites. It has been integrated into a user-friendly webtoolfor TFBS search called LASAGNA-Search. The tool currently stores precomputed PSSMmodels for 189 TFs and 133 TFs built from TFBSs in the TRANSFAC Public database(release 7.0) and the ORegAnno database (08Nov10 dump), respectively. In the future,more sources of experimentally validated TFBSs such as the PAZAR database will beincorporated into the webtool, making variable-length TFBSs more accessible toscientists in the field.

References

Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs inbiopolymers. 1994, Menlo Park: AAAI Press,
Google Scholar
Vilo J, Brazma A, Jonassen I, Ukkonen E, Robinson A: Mining for putative regulatory elements in the yeast genome using geneexpression data. Proceedings of the Eighth International Conference on Intelligent Systemsfor Molecular Biology. 2000, AAAI Press, 384-394.
Google Scholar
Barash Y, Bejerano G, Friedman N: A Simple hyper-geometric approach for discovering putative transcriptionfactor binding sites. 2001, London: Springer-Verlag,
Google Scholar
Buhler J, Tompa M: Finding motifs using random projections. 2001, New York: ACM
Book Google Scholar
Sinha S: Discriminative motifs. 2002, New York: ACM
Book Google Scholar
Takusagawa KT, Gifford DK: Negative information for motif discovery. 2004, Singapore: World Scientific
Google Scholar
Rajasekaran S, Balla S, Huang CH: Exact algorithms for planted motif problems. J Comput Biol. 2005, 12 (8): 1117-1128. 10.1089/cmb.2005.12.1117.
Article CAS PubMed Google Scholar
Balla S, Thapar V, Verma S, Luong T, Faghri T, Huang CH, Rajasekaran S, del Campo, Shinn JH, Mohler WA, Maciejewski MW, Gryk MR, Piccirillo B, Schiller SR, Schiller MR: Minimotif Miner: a tool for investigating protein function. Nat Methods. 2006, 3 (3): 175-177. 10.1038/nmeth856.
Article CAS PubMed Google Scholar
Li N, Tompa M: Analysis of computational approaches for motif discovery. Algorithms Mol Biol. 2006, 1: 8-10.1186/1748-7188-1-8.
Article PubMed Central PubMed Google Scholar
Zaslavsky E, Singh M: A combinatorial optimization approach for diverse motif findingapplications. Algorithms Mol Biol. 2006, 1: 13-10.1186/1748-7188-1-13.
Article PubMed Central PubMed Google Scholar
Yanover C, Singh M, Zaslavsky E: M are better than one: an ensemble-based motif finder and its application toregulatory element prediction. Bioinformatics. 2009, 25 (7): 868-874. 10.1093/bioinformatics/btp090.
Article PubMed Central CAS PubMed Google Scholar
Georgiev S, Boyle A, Jayasurya K, Ding X, Mukherjee S, Ohler U: Evidence-ranked motif identification. Genome Biol. 2010, 11 (2): R19-10.1186/gb-2010-11-2-r19.
Article PubMed Central PubMed Google Scholar
Osada R, Zaslavsky E, Singh M: Comparative analysis of methods for representing and searching fortranscription factor binding sites. Bioinformatics. 2004, 20 (18): 3516-3525. 10.1093/bioinformatics/bth438.
Article CAS PubMed Google Scholar
Chekmenev DS, Haid C, Kel AE: P-Match: transcription factor binding site search by combining patterns andweight matrices. Nucleic Acids Res. 2005, 33 (suppl 2): W432—W437-
PubMed Central PubMed Google Scholar
Hannenhalli S: Eukaryotic transcription factor binding sites-modeling and integrativesearch methods. Bioinformatics. 2008, 24 (11): 1325-1331. 10.1093/bioinformatics/btn198.
Article CAS PubMed Google Scholar
Salama RA, Stekel DJ: Inclusion of neighboring base interdependencies substantially improvesgenome-wide prokaryotic transcription factor binding site prediction. Nucleic Acids Res. 2010, 38 (12): e135-10.1093/nar/gkq274.
Article PubMed Central PubMed Google Scholar
Fazius E, Shelest V, Shelest E: SiTaR: a novel tool for transcription factor binding site prediction. Bioinformatics. 2011, 27: 2806-2811. 10.1093/bioinformatics/btr492.
Article CAS PubMed Google Scholar
Staden R: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984, 12 (1Part2): 505-519. 10.1093/nar/12.1Part2.505.
Article PubMed Central CAS PubMed Google Scholar
Schug J: Using TESS to predict transcription factor binding sites in DNA sequence. Current Protocols in Bioinformatics. Edited by: Baxevanis AD, Baxevanis AD . 2003, J Wiley and Sons
Google Scholar
Kel A, Gößling E, Reuter I, Cheremushkin E, Kel-Margoulis O, Wingender E: MATCH™: a tool for searching transcription factor binding sites in DNAsequences. Nucleic Acids Res. 2003, 31 (13): 3576-3579. 10.1093/nar/gkg585.
Article PubMed Central CAS PubMed Google Scholar
Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction of regulatory elements using cross-speciescomparison. Nucleic Acids Res. 2004, 32 (suppl 2): W249—W252-
PubMed Central PubMed Google Scholar
Turatsinze JVV, Thomas-Chollier M, Defrance M, van Helden: Using RSAT to scan genome sequences for transcription factor binding sitesand cis-regulatory modules. Nat Protoc. 2008, 3 (10): 1578-1588. 10.1038/nprot.2008.97.
Article CAS PubMed Google Scholar
Zambelli F, Pesole G, Pavesi G: Pscan: finding over-represented transcription factor binding site motifs insequences from co-regulated or co-expressed genes. Nucleic Acids Res. 2009, 37 (suppl 2): W247—W252-
PubMed Central PubMed Google Scholar
Kiełbasa SM, Klein H, Roider HG, Vingron M, Blüthgen N: TransFind-predicting transcriptional regulators for gene sets. Nucleic Acids Res. 2010, 38 (suppl 2): W275—W280-
PubMed Central PubMed Google Scholar
Bryne JC, Valen E, Tang MHE, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A: JASPAR, the open access database of transcription factor-binding profiles:new content and tools in the 2008 update. Nucleic Acids Res. 2008, 36 (suppl 1): D102—D106-
PubMed Central PubMed Google Scholar
Gama-Castro S, Jiménez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Peñaloza-Spinola MI, Contreras-Moreira B, Segura-Salazar J, Muñiz-Rascado L, Martínez-Flores I, Salgado H, Bonavides-Martínez C, Abreu-Goodger C, Rodríguez-Penagos C, Miranda-Ríos J, Morett E, Merino E, Huerta AM, Treviño-Quintanilla L, Collado-Vides J: RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12beyond transcription, active (experimental) annotated promoters andTextpresso navigation. Nucleic Acids Res. 2008, 36 (suppl 1): D120—D124-
PubMed Central PubMed Google Scholar
Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC®; and its module TRANSCompel®;: transcriptional generegulation in eukaryotes. Nucleic Acids Res. 2006, 34 (suppl 1): D108—D110-
PubMed Central PubMed Google Scholar
Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, Hooghe B, Van Loo P, Blanco E, Ticoll A, Lithwick S, Portales-Casamar E, Donaldson IJ, Robertson G, Wadelius C, De Bleser P, Vlieghe D, Halfon MS, Wasserman W, Hardison R, Bergman CM, Jones SJ, Consortium TORA: ORegAnno: an open-access community-driven resource for regulatoryannotation. Nucleic Acids Res. 2008, 36 (suppl 1): D107—D113-
PubMed Central PubMed Google Scholar
Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T: MatInspector and beyond: promoter analysis based on transcription factorbinding sites. Bioinformatics. 2005, 21 (13): 2933-2942. 10.1093/bioinformatics/bti473.
Article CAS PubMed Google Scholar
Kumar A, Cowen L: Recognition of beta-structural motifs using hidden Markov models trained withsimulated evolution. Bioinformatics. 2010, 26 (12): i287—i293-
Article PubMed Central PubMed Google Scholar
Thompson JD, Higgins DG, Gibson TJ, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment throughsequence weighting, position-specific gap penalties and weight matrixchoice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.
Article PubMed Central CAS PubMed Google Scholar
Larkin M, Blackshields G, Brown N, Chenna R, McGettigan P, McWilliam H, Valentin F, Wallace I, Wilm A, Lopez R, Thompson J, Gibson T, Higgins D: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.
Article CAS PubMed Google Scholar
Marinescu VD, Kohane IS, Riva A: The MAPPER database: a multi-genome catalog of putative transcription factorbinding sites. Nucleic Acids Res. 2005, 33 (suppl 1): D91—D97-
PubMed Central PubMed Google Scholar
Notredame C: Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007, 3 (8): e123-10.1371/journal.pcbi.0030123.
Article PubMed Central PubMed Google Scholar
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WSS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factorbinding sites. Nat Biotechnol. 2005, 23: 137-144. 10.1038/nbt1053.
Article CAS PubMed Google Scholar
Portales-Casamar E, Arenillas D, Lim J, Swanson MI, Jiang S, McCallum A, Kirov S, Wasserman WW: The PAZAR database of gene regulatory information coupled to the ORCA toolkitfor the study of regulatory sequences. Nucleic Acids Res. 2009, 37 (suppl 1): D54—D60-
PubMed Central PubMed Google Scholar
Schneider TD, Stormo GD, Gold L, Ehrenfeucht A: Information content of binding sites on nucleotide sequences. J Mol Biol. 1986, 188 (3): 415-431. 10.1016/0022-2836(86)90165-8.
Article CAS PubMed Google Scholar
Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-Wide Mapping of in Vivo Protein-DNA Interactions. Science. 2007, 316 (5830): 1497-1502. 10.1126/science.1141319.
Article CAS PubMed Google Scholar
Dreszer TR, Karolchik D, Zweig AS, Hinrichs AS, Raney BJ, Kuhn RM, Meyer LR, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, Pohl A, Malladi VS, Li CH, Learned K, Kirkup V, Hsu F, Harte RA, Guruvadoo L, Goldman M, Giardine BM, Fujita PA, Diekhans M, Cline MS, Clawson H, Barber GP, Haussler D, James Kent W: The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 2012, 40 (D1): D918—D923-
Article PubMed Central PubMed Google Scholar
Zhu J, Zhang MQ: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. 1999, 15 (7): 607-611. 10.1093/bioinformatics/15.7.607.
Article CAS PubMed Google Scholar
Gilbert DG: euGenes: a eukaryote genome information system. Nucleic Acids Res. 2002, 30: 145-148. 10.1093/nar/30.1.145.
Article PubMed Central CAS PubMed Google Scholar
Fawcett T: An introduction to ROC analysis. Pattern Recogn Lett. 2006, 27: 861-874. 10.1016/j.patrec.2005.10.010.
Article Google Scholar
Wilcoxon F: Individual comparisons by ranking methods. Biometrics Bull. 1945, 1 (6): 80-83. 10.2307/3001968.
Article Google Scholar
Consortium TEP: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.
Article Google Scholar
Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden: RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res. 2012, 40 (4): e31-10.1093/nar/gkr1104.
Article PubMed Central CAS PubMed Google Scholar
Farnham PJ: Insights from genomic profiling of transcription factors. Nat Rev Genet. 2009, 10 (9): 605-616. 10.1038/nrg2636.
Article PubMed Central CAS PubMed Google Scholar
Gupta S, Stamatoyannopoulos J, Bailey T, Noble W: Quantifying similarity between motifs. Genome Biol. 2007, 8 (2): R24-10.1186/gb-2007-8-2-r24.
Article PubMed Central PubMed Google Scholar
Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: A sequence logo generator. Genome Res. 2004, 14 (6): 1188-1190. 10.1101/gr.849004.
Article PubMed Central CAS PubMed Google Scholar
Kozuka T, Sugita M, Shetzline S, Gewirtz AM, Nakata Y: c-Myb and GATA-3 cooperatively regulate IL-13 expression via conserved GATA-3response element and recruit mixed lineage leukemia (MLL) for histonemodification of the IL-13 Locus. J Immunol. 2011, 187 (11): 5974-5982. 10.4049/jimmunol.1100550.
Article PubMed Central CAS PubMed Google Scholar
Blumenthal SG, Aichele G, Wirth T, Czernilofsky AP, Nordheim A, Dittmer J: Regulation of the human Interleukin-5 promoter by Ets Transcription Factors:ETS1 AND ETS2, BUT NOT ELF-1, COOPERATE WITH GATA3 AND HTLV-I TAX1. J Biol Chem. 1999, 274 (18): 12910-12916. 10.1074/jbc.274.18.12910.
Article CAS PubMed Google Scholar
Wang J, Shannon MF, Young IG: A role for Ets1, synergizing with AP-1 and GATA-3 in the regulation of IL-5transcription in mouse Th2 lymphocytes. Int Immunol. 2006, 18 (2): 313-323.
Article CAS PubMed Google Scholar
Bi C, Leeder J, Vyhlidal C: A comparative study on computational two-block motif detection: algorithmsand applications. Mol Pharm. 2007, 5: 3-16.
Article PubMed Google Scholar
Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significantalignments of multiple sequences. Bioinformatics. 1999, 15 (7): 563-577. 10.1093/bioinformatics/15.7.563.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We are indebted to the two anonymous reviewers, whose comments greatly improvedthis paper. This work was supported in part by National Science Foundation[grant numbers CCF-0755373 and OCI-1156837].

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Connecticut, Fairfield Road, Storrs, CT, 06269, USA
Chih Lee & Chun-Hsi Huang

Authors

Chih Lee
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Hsi Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun-Hsi Huang.

Additional information

Competing interests

Both authors declare that they have no competing interests.

Authors’ contributions

CH conceived the study. CL collected the data, carried out the experiments anddrafted the manuscript. CH guided the study and revised the manuscript. Both authorsread and approved the final manuscript.

Electronic supplementary material

Additional file 1: LASAGNA-ChIP flowchart.(PDF 301 KB)

Additional file 2: Distribution of K_s by species and conserved domain.(PDF 17 KB)

12859_2012_5933_MOESM3_ESM.xls

Additional file 3: Summary of 38 mouse ChIP-seq experiments. Each row shows thetrack name in the UCSC Genome Browser, cell type, target TF, number ofpeak sequences as well as the minimum, maximum, mean and standarddeviation of peak sequence length. (XLS 120 KB)

12859_2012_5933_MOESM4_ESM.pdf

Additional file 4: Motifs found by LASAGNA-ChIP and MEME. For each ChIP-seqexperiment, the sequence logos of motifs found by LASAGNA-ChIP and MEMEare shown. The matching motifs in the TRANSFAC Public and UniPROBEdatabases found by TOMTOM are listed below each sequence logo. The firstChIPed motif TF is highlighted in yellow if it is among the matchingmotifs. When the found motif does not resemble those of the ChIPed TF,the first cofactor of the ChIPed TF is highlighted in blue if it isamong the matching motifs. Other possibly correct matches arehighlighted in green. (PDF 491 KB)

12859_2012_5933_MOESM5_ESM.xls

Additional file 5: List of LASAGNA-built models based on TRANSFAC/ORegAnno TFBSs.Only models whose counterparts can be found in matrices in TRANSFACPublic are listed. The IC and number of sites are shown for eachmodel. (XLS 49 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), whichpermits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.

Reprints and permissions

About this article

Cite this article

Lee, C., Huang, CH. LASAGNA: A novel algorithm for transcription factor binding site alignment. BMC Bioinformatics 14, 108 (2013). https://doi.org/10.1186/1471-2105-14-108

Download citation

Received: 24 July 2012
Accepted: 08 March 2013
Published: 24 March 2013
DOI: https://doi.org/10.1186/1471-2105-14-108

LASAGNA: A novel algorithm for transcription factor binding site alignment

Abstract

Background

Results

Conclusions

Background

Methods

The search module

The LASAGNA algorithm

LASAGNA for ChIP-seq data

Scoring a putative binding site

Results and discussion

Comparison of alignment algorithms

Data sets

Performance assessment and evaluation metrics

Comparison with ClustalW2

Comparison with MEME

Distribution of Ks

Comparison of TFBS search methods

Data sets

Performance assessment and evaluation metrics

Comparison with an alignment-free method

Application of LASAGNA-ChIP to ChIP-seq data

LASAGNA is simple and effective

Implementation

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us

Distribution of K_s