Skip to main content

Extracting transcription factor binding sites from unaligned gene sequences with statistical models

Abstract

Background

Transcription factor binding sites (TFBSs) are crucial in the regulation of gene transcription. Recently, chromatin immunoprecipitation followed by cDNA microarray hybridization (ChIP-chip array) has been used to identify potential regulatory sequences, but the procedure can only map the probable protein-DNA interaction loci within 1–2 kb resolution. To find out the exact binding motifs, it is necessary to build a computational method to examine the ChIP-chip array binding sequences and search for possible motifs representing the transcription factor binding sites.

Results

We developed a program to find out accurate motif sites from a set of unaligned DNA sequences in the yeast genome. Compared with MDscan, the prediction results suggest that, overall, our algorithm outperforms MDscan since the predicted motifs are more consistent with previously known specificities reported in the literature and have better prediction ranks. Our program also outperforms the constraint-less Cosmo program, especially in the elimination of false positives.

Conclusion

In this study, an improved sampling algorithm is proposed to incorporate the binomial probability model to build significant initial candidate motif sets. By investigating the statistical dependence between base positions in TFBSs, the method of dependency graphs and their expanded Bayesian networks is combined. The results show that our program satisfactorily extract transcription factor binding sites from unaligned gene sequences.

Background

Understanding transcription is central to understanding genetic regulatory mechanisms. The transcription of a gene is generally dependent on the presence of specific signals located at upstream regions of the core-promoter. These specific signals derive from their use as binding sites by transcription factors (TFs), and are therefore termed transcription factor binding sites (TFBSs). Recently, chromatin immunoprecipitation followed by cDNA microarray hybridization (ChIP-chip array) has been used to identify potential regulatory sequences, but the procedure can only map the probable protein-DNA interaction loci within 1–2 kilobases resolution [1]. To find out the exact binding motifs, it is necessary to build a computational method to examine the ChIP-chip array binding sequences and search for possible motifs representing the TFBSs (motif discovery).

There are many computational TFBS motif finding tools available [24]. The traditional approach for finding TFBSs is to collect and align a set of promoter sequences of co-regulated genes from either the literature or systematic experiments. Numerous computational tools, such as CONSENSUS [5], EM [6], MEME [7] and the Gibbs sampler [8], have utilized the approach to identify short DNA sequence motifs which are statistically over-represented in the promoter sequences.

Other than the alignment-based motif finding algorithms in above, many approaches have tried to extend to the use of evolutionary conservation information such as phylogenetic footprinting or the detection of combinations of binding sites (termed as cis-regulatory modules; CRMs) [2, 3]. Phylogenetic footprinting methods [911] is an approach that seeks to identify conserved regulatory elements by comparing genomic sequences between related species. However, due to the statistical nature of the approach, e.g., a small amount of closely related species, not all transcription binding sites can be found by using phylogenetic footprinting. Hence, some algorithms have emerged to combine the alignment-based motif prediction with phylogenetic footprinting such as PhyloGibbs [12] and MY sampler [13]. On the other hand, by the detection of CRMs due to the cooperative interactions between TFs, algorithms like those in [1416] can produce predictions of substantially better specificity than those of isolated sites.

Recently, more effective motif finders, e.g., MDscan [1], ANN-Spec [17], DMOTIFS [18], DME [19] and Cosmo [20], have taken the advantage of a background set, serving as a negative control. The goal of these discriminant motif finders is to search only for motifs that are most discriminating, that is, only those enriched in the foreground set relative to the background set [2]. Although these motif finders have improved the performance of TFBS prediction, it is still a trouble to have a satisfactory solution. How to find out accurate binding motifs may require much attention in the computational biology community. In this study, an improved sampling algorithm is proposed to incorporate the binomial probability model to build significant initial motif sets. By investigating the statistical dependence between base positions in TFBSs, it appears feasible to use statistical models to formulate the structural dependence of a motif in the identification of TFBSs. In light of this observation, the method of dependency graphs and their expanded Bayesian networks [21] is combined and prediction results show that our algorithm is able to find out motifs more consistent with previously known evidence.

Methods

Let TF be one of the transcription factors to be investigated. The binding dataset of the transcription factor TF, denoted as B TF , consists of the sequences with low binding p-value (< 0.001) to the TF in the ChIP-chip array data [22]. A sliding window of size w is used to extract segments of length w when sliding through each of the sequences in B TF .

Let S TF be the collection of all extracted segments from B TF , M the number of sequences in the binding dataset B TF , L i the length of the i th sequence in the binding dataset B TF , and T TF the total number of segments in S TF . Then

T T F = i = 1 M ( L i w + 1 ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemivaq1aaSbaaSqaaiabdsfaujabdAeagbqabaGccqGH9aqpdaaeWbqaaiabcIcaOiabdYeamnaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0Iaem4DaCNaey4kaSIaeGymaeJaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGnbqta0GaeyyeIuoakiabc6caUaaa@4114@

To discover the binding motifs of the transcription factor TF, a number of initial candidate motif sets for TF is subsequently built from the collection S TF of extracted segments. Note that the contents of segments, called patterns, in S TF may not be distinct.

Most of early motif finding algorithms, such as Gibbs sampler [8] and MEME [23], have a weakness, where initial candidate motif sets are built by randomly extracting segments from sequences in the binding dataset B TF (i.e. randomly selecting segments from the set S TF ). To improve the deficiency, the binomial probability distribution model is firstly utilized in the establishment of a number of initial candidate motif sets in our algorithm.

Then in the process of iterative sampling in our algorithm to expand and/or trim each of the initial candidate motif sets, the method of dependency graphs and their expanded Bayesian networks [21] is used to develop a statistical model for the background motif set identified as the union S = TF S TF of segments extracted from all transcription factor binding datasets.

The basic procedure to find the binding motifs of the transcription factor TF is as follows:

  1. 1.

    Build N initial candidate motif sets.

  2. (a)

    Take N distinct patterns from the set S TF with the most highest significance scores as the candidates by the binomial distribution model (see the Binomial probability distribution model subsection).

  3. (b)

    Then for each of the N significant binding site candidates for the transcription factor TF, in view of evolution, collect all segments in S TF whose patterns have no more than d Hamming distance matching to the candidate pattern to form an initial candidate motif set.

At this stage, N initial candidate motif sets for the transcription factor TF are built.

  1. 2.

    Iteratively sample through the binding dateset B TF to expand and/or trim each of the N initial candidate motif sets so that their approximate maximum a posteriori (AMAP) scores [1, 24] can keep increasing until the N candidate motif sets are invariant in K consecutive iterations (see the Iterative sampling subsection).

  2. (a)

    In the calculation of AMAP scores in this stage, the background model for the background motif set S = TF S TF is established under the method of dependency graphs and their expanded Bayesian networks (see the Method of dependency graphs and their expanded Bayesian networks subsection).

  3. 3.

    Refine each of the N candidate motif sets by re-examining all the segments already included in the motif set. A segment is removed from the motif set if doing so increases the AMAP score.

A simple flowchart for our algorithm is shown in Figure 1. The following subsections will expatiate on each stage of our algorithm. As an illustration of the dynamics of the PWM and the rank of different candidate motif sets at different stages of our algorithm, a summary of the prediction process for the motif of the transcription factor CBF1 is given in Figure 2.

Figure 1
figure 1

A flowchart of our algorithm.

Figure 2
figure 2

An illustration of the dynamics of the PWM and the rank of five candidate motif sets at different stages of our algorithm in the motif prediction of the transcription factor CBF1.

Initial motif sets building

Our method begins by enumerating those patterns in S TF that appear most often in the binding dataset B TF than in others. What we want to do first is to calculate the appearance probability of a pattern in S TF , which is the probability that the pattern appears no less than n times in the binding dataset B TF . If a pattern b appears more often than other patterns in S TF and its occurrence probability in a generic intergenic region is comparatively low, the calculated significance score of b would be relatively high. We will take patterns with the most highest significance scores as the candidates to build a number of initial candidate motif sets.

Binomial probability distribution model

The probability to observe exactly j occurrences of pattern b in the collection S TF of segments extracted from the binding dataset B TF is estimated by the binomial distribution

P T F ( o c c ( b ) = j ) = ( T T F j ) × ( f ( b ) ) j × ( 1 f ( b ) ) T T F j , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aaSbaaSqaaiabdsfaujabdAeagbqabaGccqGGOaakcqWGVbWBcqWGJbWycqWGJbWycqGGOaakcqWGIbGycqGGPaqkcqGH9aqpcqWGQbGAcqGGPaqkcqGH9aqpdaqadaqaauaabeqaceaaaeaacqWGubavdaWgaaWcbaGaemivaqLaemOrayeabeaaaOqaaiabdQgaQbaaaiaawIcacaGLPaaacqGHxdaTcqGGOaakcqWGMbGzcqGGOaakcqWGIbGycqGGPaqkcqGGPaqkdaahaaWcbeqaaiabdQgaQbaakiabgEna0kabcIcaOiabigdaXiabgkHiTiabdAgaMjabcIcaOiabdkgaIjabcMcaPiabcMcaPmaaCaaaleqabaGaemivaq1aaSbaaWqaaiabdsfaujabdAeagbqabaWccqGHsislcqWGQbGAaaGccqGGSaalaaa@5D42@

where occ(b) is the occurrence times of pattern b in S TF and f (b) is the probability that pattern b occurs in the intergenic region and is estimated as the relative frequency of pattern b in the union S = TF S TF of segments extracted from all transcription factor binding datasets. The probability to observe n or more occurrences of the pattern b in S TF is

P T F ( o c c ( b ) n ) = 1 j = 0 n 1 P T F ( o c c ( b ) = j ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aaSbaaSqaaiabdsfaujabdAeagbqabaGccqGGOaakcqWGVbWBcqWGJbWycqWGJbWycqGGOaakcqWGIbGycqGGPaqkcqGHLjYScqWGUbGBcqGGPaqkcqGH9aqpcqaIXaqmcqGHsisldaaeWbqaaiabdcfaqnaaBaaaleaacqWGubavcqWGgbGraeqaaOGaeiikaGIaem4Ba8Maem4yamMaem4yamMaeiikaGIaemOyaiMaeiykaKIaeyypa0JaemOAaOMaeiykaKcaleaacqWGQbGAcqGH9aqpcqaIWaamaeaacqWGUbGBcqGHsislcqaIXaqma0GaeyyeIuoakiabc6caUaaa@570D@
(1)

We define the significance score sig TF (b) of a pattern b to TF as

sig TF (b) = -log10(P TF (occ(b) ≥ n)).

The less probable pattern b in S appears more than n times in S TF , the more probable will it be a binding site candidate for the transcription factor TF. We will take N distinct patterns with the most highest significance scores as the candidates.

For each of the N significant binding site candidates for the transcription factor TF, in view of evolution, collect all segments in S TF whose patterns have no more than d Hamming distance matching to the candidate pattern to form an initial candidate motif set. Thus N initial candidate motif sets for the transcription factor TF are built at the end of this stage. As an example, the PWM and the rank of five initial candidate motif sets for the motif prediction of the transcription factor CBF1 are shown in Figure 2.

Iterative sampling

In this stage, a sampling method is used to expand and/or trim each of the N initial candidate motif sets M1, M2,.., M N . For our purpose, a false motif set MN+1is created by randomly selecting e0 (e0 is equal to the maximum size of the N initial candidate motif sets) segments from the collection S TF such that M i MN+1= , for all i = 1, 2,..., N. In addition, let the collection S = TF S TF of segments extracted from all transcription factor binding datasets represent the intergenic background and here be denoted as M BG .

Approximate maximum a posteriori (AMAP) measure

The score a m a p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemyBa0MaemyyaeMaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3419@ of the approximate maximum a posteriori (AMAP) measure of the candidate motif set M i is defined as [1, 24]

a m a p M i = 1 w { s = 0 w 1 j A , T , G , C p s , j log ( p s , j ) 1 n i m M i log ( P ( m | M B G ) ) } , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemyBa0MaemyyaeMaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaem4DaChaaOWaaiWaaeaadaaeWbqaamaaqafabaGaemiCaa3aaSbaaSqaaiabdohaZjabcYcaSiabdQgaQbqabaGccyGGSbaBcqGGVbWBcqGGNbWzcqGGOaakcqWGWbaCdaWgaaWcbaGaem4CamNaeiilaWIaemOAaOgabeaakiabcMcaPaWcbaGaemOAaOMaeyicI4SaemyqaeKaeiilaWIaemivaqLaeiilaWIaem4raCKaeiilaWIaem4qameabeqdcqGHris5aaWcbaGaem4CamNaeyypa0JaeGimaadabaGaem4DaCNaeyOeI0IaeGymaedaniabggHiLdGccqGHsisldaWcaaqaaiabigdaXaqaaiabd6gaUnaaBaaaleaacqWGPbqAaeqaaaaakmaaqafabaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemiuaaLaeiikaGIaemyBa0MaeiiFaWNaemyta00aaSbaaSqaaiabdkeacjabdEeahbqabaGccqGGPaqkcqGGPaqkaSqaaiabd2gaTjabgIGiolabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeqdcqGHris5aaGccaGL7bGaayzFaaGaeiilaWcaaa@7CBE@

where ps, jis the frequency of nucleotide j at base position s in the candidate motif set M i (which can be retrieved from the position specific scoring matrix (PSSM) of M i ), n i is the number of segments in M i , and P(m|M BG ) is the probability of the pattern of segment m in the motif set M i under an expanded Bayesian network (EBN) model [21] developed from the background motif set M BG (EBN model will be discussed shortly).

The first part of the AMAP score is a negative entropy, which is higher if there are more similar patterns in the candidate motif set M i . A motif set M i with all identical patterns has the maximum negative entropy 0, whereas equal nucleotide frequencies at every position in the PSSM of M i has the minimum negative entropy. And a segment m in the candidate motif set M i which has a pattern much different from the background motif model built from M BG would have lower appearance probability P (m|M BG ) and hence increases the score a m a p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemyBa0MaemyyaeMaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3419@ of the AMAP measure of M i .

Sampling strategy

In each iteration, there are two steps for the sampler, the S-step and the M-step.

In the S-step, the sampler samples a site by randomly selecting a sequence from B TF and then randomly picking up a site in the selected sequence to extract a segment m s of length w. For 1 ≤ iN, if the sampled segment m s appears in M i , segment m s will be removed from M i if the AMAP score a m a p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemyBa0MaemyyaeMaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3419@ of the candidate motif set M i increases after its removal; otherwise, segment m s will be kept in M i . Note that the PSSM of the motif model M i should be retrained if the sampled segment m s is removed from M i .

Which one of the N + 1 motif sets would be the best motif set for the sampled segment m s will depend on the appendant score a p p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemiCaaNaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@32D4@ that the segment m s is derived from M i [24, 25]

a p p M i = log ( P ( n i ) 1 P ( n i ) P ( m s | M i ) P ( m s | M B G ) ) , 1 i N + 1 , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemiCaaNaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaakiabg2da9iGbcYgaSjabc+gaVjabcEgaNnaabmaajuaGbaWaaSaaaeaacqWGqbaucqGGOaakcqWGUbGBdaWgaaqaaiabdMgaPbqabaGaeiykaKcabaGaeGymaeJaeyOeI0IaemiuaaLaeiikaGIaemOBa42aaSbaaeaacqWGPbqAaeqaaiabcMcaPaaadaWcaaqaaiabdcfaqjabcIcaOiabd2gaTnaaBaaabaGaem4CamhabeaacqGG8baFcqWGnbqtdaWgaaqaaiabdMgaPbqabaGaeiykaKcabaGaemiuaaLaeiikaGIaemyBa02aaSbaaeaacqWGZbWCaeqaaiabcYha8jabd2eannaaBaaabaGaemOqaiKaem4raCeabeaacqGGPaqkaaaakiaawIcacaGLPaaacqGGSaalcqaIXaqmcqGHKjYOcqWGPbqAcqGHKjYOcqWGobGtcqGHRaWkcqaIXaqmcqGGSaalaaa@66F5@
(2)

where n i is the size of current motif set M i , P(n i ) equals n i T T F MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGUbGBdaWgaaqaaiabdMgaPbqabaaabaGaemivaq1aaSbaaeaacqWGubavcqWGgbGraeqaaaaaaaa@32EC@ , P(m s |M i ) and P(m s |M BG ) are the probabilities of the content of the sampled segment m s under the PSSM model developed from the current motif set M i and under an EBN model developed from the background M BG , respectively. The sampled segment m s will be considered to append into the motif set M i with the highest appendant score a p p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemiCaaNaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@32D4@ . If a p p M N + 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemiCaaNaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGobGtcqGHRaWkcqaIXaqmaeqaaaWcbeaaaaa@3470@ is the highest score, then the sampled segment m s is appended into the false motif set MN+1unless m s is already there and the current iteration stops here. If for some i, 1 ≤ iN, a p p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemiCaaNaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@32D4@ is the highest score, the sampled segment m s will be further checked in the M-step to see if we really want to append m s into M i unless we have processed m s for M i at the beginning of this S-step as in above and the current iteration stops here.

In the M-step, the sampler has to decide whether the newly sampled segment m s should be appended into the candidate motif set M i or not. The AMAP measure again will be used to evaluate our decision. The sampled segment m s is appended into the candidate motif set M i if and only if the score a m a p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemyBa0MaemyyaeMaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3419@ of the motif model M i is increased once the sampled segment m s is appended to M i . Note that the PSSM of the motif model M i should be retrained after the sampled segment b s is appended to M i . Now the M-step is done and the current iteration stops here.

The sampler will iteratively sample through the binding dataset B TF to expand and/or trim the N candidate motif sets M1, M2,.., M N so that their AMAP scores a m a p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemyBa0MaemyyaeMaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3419@ will keep increasing. The N candidate motif sets will tend to be invariant after a (larger) number of iterations. The stopping criterion of the sampling process is that all the N candidate motif sets are invariant in K consecutive iterations. The parameter K is usually set to be 1% of the size of S TF .

Alternative sampling strategy

There is an alternative sampling strategy as follows.

In the S-step, the new sampler also randomly samples a site from a sequence in B TF to extract a segment m s of length w. For 1 ≤ iN, if the pattern of the sampled segment m s appears in M i , all the segments in M i whose pattern is the same as that of m s will be removed if the AMAP score a m a p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemyBa0MaemyyaeMaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3419@ of the motif set M i increases after their removal. Otherwise, these segments will be kept in M i .

Also in the S-step, if a p p M N + 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemiCaaNaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGobGtcqGHRaWkcqaIXaqmaeqaaaWcbeaaaaa@3470@ is the highest among all a p p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemiCaaNaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@32D4@ , 1 ≤ iN + 1, then all segments in the set S TF having the same pattern as that of the sampled segment m s will be appended into the false motif set MN+1unless these segments are already there and the current iteration stops here. If a p p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemiCaaNaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@32D4@ is the highest for some i, 1 ≤ iN, the sampled segment m s will be further checked in the M-step to see if we really want to append those segments in the set S TF having the same pattern as that of the sampled segment m s into M i unless we have already processed those segments for M i at the beginning of this S-step as in above and the current iteration stops here.

In the M-step, all the segments in the set S TF having the same pattern as that of the sampled segment m s are decided to append to the candidate motif set M i if and only if the AMAP score a m a p M i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyaeMaemyBa0MaemyyaeMaemiCaa3aaSbaaSqaaiabd2eannaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3419@ of M i increases after these segaments are appended into M i .

Method of dependency graphs and their expanded Bayesian networks

Considering the binding mechanism of transcription factors to specific DNA sites (motifs), there must be distinctive features for the specific motif regions from other intergenic regions which represent the background DNA sequence. Hence, it is conceivable that we can use a statistical model to capture the feature of a specific DAN site (motif) or a generic DNA intergenic region (background). Since the size of a candidate motif set M i is often small, a PSSM model is commonly used for M i instead of any other more sophisticated statistical model. However, the size of the background motif set M BG is usually large enough to be equipped with a more sophisticated one.

As reported in [21], a dependency graph model is used to fully capture the intrinsic interdependency between base positions in a motif or region. The establishment of dependency between two positions is based on a χ2-test from known sample data. An edge is established between two nodes (a node represents a base position) in the graph if the two corresponding base positions of the motif or region are dependent. After all dependent edges have being established completely, a dependency graph for the motif or region is constructed. An example of a dependency graph with 7 nodes is shown in Figure 3.

Figure 3
figure 3

An example of dependency graph.

As reported in [21], although the dependency graph can fully capture the intrinsic interdependency between base positions in a motif or region, it is difficult, if not impossible, to perform statistical inference based on the dependency graph. To resolve the dilemma, the dependency graph is expanded to form a Bayesian network (which is a directed acyclic graph that facilitates statistical reasoning) by allowing a base position in the dependency graph to appear more than once in the Bayesian network as nominally distinct nodes. Figure 4 shows an example of an expanded Bayesian network of the dependency graph in Figure 3. For the detailed procedure of constructing an expanded Bayesian network (EBN) from a dependency graph, please see [21]. In this paper, we use EBNs to model the background motif set M BG .

Figure 4
figure 4

An example of expanded Bayesian network.

Continued with the same example of the motif prediction of the transcription factor CBF1, the PWM and the rank of the five candidate motif sets at the end of the iterative sampling stage are also shown in Figure 2, together with the final results at the end of the refinement stage.

Results and discussion

Data

In order to search for the transcription factor binding sites that regulate gene expressions, we collected binding promotor sequences from the cDNA microarray hybridization (ChIp-chip array) of yeast genome [22]. Each of the binding sequences may contain some unknown motifs that are implanted at unknown positions. These data represent the binding affinity of a target transcription factor to the promoter region of a gene in vivo. The experiment protocol assigns a binding p-value to each binding promoter sequence of the corresponding transcription factor. A sequence with binding p-value less than 0.001 is considered to be bound by the corresponding transcription factor. The threshold of 0.001 is set up to reduce the false positive identification in yeast genome-wide screening.

We collected the ChIp-chip array sequence data from the "Motif discovery results – Discovered motifs, version 24" at [26]. For a transcription factor TF to be investigated, we collected all sequences with binding p-value less than the threshold 0.001 to TF into the binding dataset B TF . There are 65 binding datasets B TF being able to be collected from Harbison's website.

Accuracy measurement and comparison

To evaluate the performance of our program, we collected known specificities from many famous websites, such as YPD, SCPD, Transfac and from the literature with experimental evidence [27] to compare with the discovered specificities predicted by our program.

Among the 65 binding datasets B TF collected from Harbison's website, we chose 36 transcription factor binding datasets which have known specificities with experimental evidence to evaluate the performance of our program. The results of our program for the 36 transcription factor binding datasets are listed in Figure 5. It is deserved to be mentioned that the specificity reported for transcription factor PHO2 in Harbrison et al's website is "GTGCGsyGCG", while the predicted result of our program is "ATTATC". In this case, the newly found motif by our program is more consistent with the results reported by Barbaric et al [28] that PHO2 binds to an AT-rich region than the specificity reported in Harbrison et al's website.

Figure 5
figure 5

Predicted results of our program compared with known evidence. The letter symbols used in the 'Known specificity' column have the following mapping: aA: a tT: t gG: g cC: c wW: at rR: ag mM: ac kK: tg yY: tc sS: gc dD: atg hH: atc vV: agc bB: tgc nN: atgc

In this study, we compared our program with two online programs, MDscan [29] and Cosmo [30]. MDscan is a famous program that can be used to examine the ChIP-array selected sequences and search for DNA sequence motifs representing the protein-DNA interaction sites. It takes the advantage of combining two widely adopted motif search strategies, word enumeration and position-specific weight matrix updating, and incorporates the ChIP enrichment information to accelerate the search and enhance its success rate. The comparison of MDscan with our program is shown in Table 1. Also reported in Table 1 is the performance of our algorithm when the the PSSM model [8] instead of the EBN model [21] is used to model the background motif set M BG in the calculation of the AMAP scores of the N candidate motif sets and the appendant scores of the N + 1 motif sets. In Table 1, for each transcription factor, the number in each 'Rank' column indicates the rank of the predicted motif which is most consistent with the known evidence from the top ten predicted candidate motifs.

Table 1 Comparison of MDscan and our program.

As shown in Table 1, our approach with EBN background model outperforms the other two methods. Our approach with EBN background model gives 30 out of the 36 most predicted motifs for the corresponding 36 transcription factors with the 1st rank, while MDscan and our approach with PSSM background model give only 20 out of 36 and 15 out of 36 most predicted motifs with the 1st rank, respectively. Moreover, MDscan fails in discovering a motif for three transcription factor binding datasets, while our approach in this study is still able to predict a motif consistent with the known evidence.

Cosmo (constrained search for motifs) is a general purpose algorithm for conserved motif detection that allows the search to be supervised by specifying a set of constraints that the PWM of the unknown motif must satisfy. Such constraints may be formulated derived from prior biological knowledge about the structure of the transcription factor, such as the length of the motif intervals. Although Cosmo is based on the same two-component multinomial mixture model used in MEME, it employs the likelihood principle instead of the E-value criterion in MEME. In addition, three model types (OOPS, ZOOPS, or TCM) can data-adaptively be selected in Cosmo to achieve better performance. Since there is no prior knowledge used in our program, we compared it to the constraint-less version of the Cosmo program. On the other hand, since the Cosmo program reports only one motif PWM for a dataset, instead of a list of ranked candidate motif PWMs as in MDscan, we adopted only the rank 1 results of our program in this comparison. To evaluate the performance of both programs, we used the statistics proposed by Tompa et al. [4]. For a (computational) tool at the site level, the performance statistics on a dataset are defined as follows:

Sensitivity Positive predictive value Average site performance : : : S n = T P / ( T P + F N ) P P V = T P / ( T P + F P ) A S P = ( S n + P P V ) / 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeWaaaqaauaabiqadeaaaeaacqqGtbWucqqGLbqzcqqGUbGBcqqGZbWCcqqGPbqAcqqG0baDcqqGPbqAcqqG2bGDcqqGPbqAcqqG0baDcqqG5bqEaeaacqqGqbaucqqGVbWBcqqGZbWCcqqGPbqAcqqG0baDcqqGPbqAcqqG2bGDcqqGLbqzcqqGGaaicqqGWbaCcqqGYbGCcqqGLbqzcqqGKbazcqqGPbqAcqqGJbWycqqG0baDcqqGPbqAcqqG2bGDcqqGLbqzcqqGGaaicqqG2bGDcqqGHbqycqqGSbaBcqqG1bqDcqqGLbqzaeaacqqGbbqqcqqG2bGDcqqGLbqzcqqGYbGCcqqGHbqycqqGNbWzcqqGLbqzcqqGGaaicqqGZbWCcqqGPbqAcqqG0baDcqqGLbqzcqqGGaaicqqGWbaCcqqGLbqzcqqGYbGCcqqGMbGzcqqGVbWBcqqGYbGCcqqGTbqBcqqGHbqycqqGUbGBcqqGJbWycqqGLbqzaaaabaqbaeqabmqaaaqaaiabcQda6aqaaiabcQda6aqaaiabcQda6aaaaeaafaqabeWabaaabaGaem4uamLaemOBa4Maeyypa0JaemivaqLaemiuaaLaei4la8IaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4KaeiykaKcabaGaemiuaaLaemiuaaLaemOvayLaeyypa0JaemivaqLaemiuaaLaei4la8IaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaaLaeiykaKcabaGaemyqaeKaem4uamLaemiuaaLaeyypa0JaeiikaGIaem4uamLaemOBa4Maey4kaSIaemiuaaLaemiuaaLaemOvayLaeiykaKIaei4la8IaeGOmaidaaaaaaaa@AA7B@

where TP is the number of known sites overlapped by predicted sites, FN is the number of known sites not overlapped by predicted sites, and FP is the number of predicted sites not overlapped by known sites. To summarize the performance of a given tool over a collection C MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NaXpeaaa@374D@ of datasets, we compute the "combined" statistics as though C were one large dataset by adding TP, FP and FN respectively over the datasets in C MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NaXpeaaa@374D@ . Then the combined statistics of our program are Sn = 0.6698, PPV = 0.8206, and ASP = 0.7452, while those of Cosmo are Sn = 0.6573, PPV = 0.5134 and ASP = 0.5854. For the detailed Cosmo prediction results and the comparison of the two programs, please see Figure S1 (see Additional file 1). The comparison shows that our program can offer better performance than Cosmo, especially in the elimination of false positives.

The parameters used in our program include the sliding window size w used to extract segments from binding datasets B TF to form S TF , the Hamming distance d used to collect segments from S TF to establish initial candidate motif sets, the number of most dependent edges used to form a dependency graph for the background motif model and the number of parents used in the construction of an expanded Bayesian network from the dependency graph [21]. The parameters used in our program to give the best predicted motifs for each of the 36 transcription factors are listed in Table S1 (see Additional file 2). Comparing the performance of the two sampling strategies discussed in the Method section, as shown in Figure S2 (see Additional file 3), we found that the alternative sampler is faster and has almost identical best predicted motifs with those by the primary sampler, except that transcription factors GCN4, HAP4 and PHO4 have the best predicted motifs one nucleotide position shift from those by the primary sampler. In addition, the alternative sampler is slightly better than the primary sampler in the sense that the best predicted motif for the transcription factor DIG1 promotes its rank from the 2nd place by the primary sampler to the 1st place by the alternative sampler.

Conclusion

In this study, we employed the binomial probability model to establish a number of initial candidate motif sets, and used the method of dependence graphs and their expanded Bayesian networks to model the background motif set as a control to predict TFBSs (motifs) from a set of unaligned DNA sequences. The prediction results suggest that, overall, our algorithm outperforms MDscan since the predicted motifs are more consistent with previously known specificities reported in the literature and have better prediction ranks. And when compared with the constraint-less Cosmo program, our algorithm has a slightly higher combined sensitivity Sn, a much higher positive predictive value PPV and a higher average site performance ASP. However, the performance of our algorithm is not much better if the length of possible binding sites are too long (more than 12 bps). Further research is needed to discover long motifs.

Furthermore, variable spacing within binding sites is legitimate for some transcription factors while this study focuses on ungapped motif discovery. Programs such as BIPAD [31] and spaced dyad [32] have investigated into such a bipartitie sequence element discovery problem. Therefore another direction for our future research is to investigate into gapped motifs.

References

  1. Liu XS, Brutlag DL, Liu JS: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitaion microarray experiments. Nat Biotechnol. 2002, 20: 835-839.

    Article  CAS  PubMed  Google Scholar 

  2. Zhang MQ: Computational analyses of eukaryotic promoters. BMC Bioinformatics. 2007, 8 (Suppl 6):

    Google Scholar 

  3. Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004, 5: 276-287. 10.1038/nrg1315.

    Article  CAS  PubMed  Google Scholar 

  4. Tompa M, Li N, Bailey TL, Church GM, Moor BD, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23: 137-144. 10.1038/nbt1053.

    Article  CAS  PubMed  Google Scholar 

  5. Hertz GZ, George W, Hartzell I, Stormo GD: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci. 1990, 6: 81-92.

    CAS  PubMed  Google Scholar 

  6. Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins. 1990, 7: 41-51. 10.1002/prot.340070105.

    Article  CAS  PubMed  Google Scholar 

  7. Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proceedings of the Third International Comference on Intelligent Systems for Molecular Biology. 1995, Menlo Park, CA: AAAI Press, 21-29.

    Google Scholar 

  8. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214. 10.1126/science.8211139.

    Article  CAS  PubMed  Google Scholar 

  9. Tagle DA, Koop BF, Goodman M, Slightom JL, Hess DL, Jones RT: Embryonic ε and γ globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol. 1988, 203: 439-455. 10.1016/0022-2836(88)90011-3.

    Article  CAS  PubMed  Google Scholar 

  10. Blanchette M, Schwikowski B, Tompa M: Algorithms for phylogenetic footprinting. J Comput Biol. 2002, 9: 211-223. 10.1089/10665270252935421.

    Article  CAS  PubMed  Google Scholar 

  11. Lenhard B, Sandelin A, Mendoza L, Engstrom1 P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol. 2003, 2: 13-10.1186/1475-4924-2-13.

    Article  PubMed Central  PubMed  Google Scholar 

  12. Siddhartan R, Siggia ED, van Nimwegen E: PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol. 2005, 1: e67-10.1371/journal.pcbi.0010067.

    Article  Google Scholar 

  13. Andersson SA, Lagergren J: Motif Yggdrasil: Sampling sequence motifs from a tree mixture model. J Comput Biol. 2007, 14 (5): 682-697. 10.1089/cmb.2007.R010.

    Article  CAS  PubMed  Google Scholar 

  14. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol. 1998, 278: 167-181. 10.1006/jmbi.1998.1700.

    Article  CAS  PubMed  Google Scholar 

  15. Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics. 2003, 19 (Suppl 1): i169-i176. 10.1093/bioinformatics/btg1021.

    Article  PubMed  Google Scholar 

  16. Aerts S, Van Loo P, Thijs G, Moreau Y, De Moor B: Computational detection of cis-regulatory modules. Bioinformatics. 2003, 19 (suppl 2): ii5-ii14.

    Article  PubMed  Google Scholar 

  17. Workman CT, Stormo GD: ANN-Spec: A method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2002, 5: 467-478.

    Google Scholar 

  18. Sinha S: Discriminative motifs. J Comput Biol. 2003, 10: 599-615. 10.1089/10665270360688219.

    Article  CAS  PubMed  Google Scholar 

  19. Smith AD, Sumazin P, Zhang MQ: Identifying tissue-selective transcription factor binding sites in vertebrate promoters. Proc Natl Acad Sci USA. 2005, 102: 1560-1565. 10.1073/pnas.0406123102.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Bembom O, Keles S, van der Laan MJ: Supervised detection of conserved motifs in DNA sequences with Cosmo. Stat Appl Genet Mol Biol. 2007, 6: 8-

    Google Scholar 

  21. Chen TM, Lu CC, Li WH: Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2004, 21: 471-482. 10.1093/bioinformatics/bti025.

    Article  PubMed  Google Scholar 

  22. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, KT T, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431: 99-104. 10.1038/nature02800.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Bailey TL, Elkan C: Unsupervised learning of multiple motif in biopolymers using expectation maximization. Machine Learning. 1995, 21: 51-80.

    Google Scholar 

  24. Liu J, Neuwald AF, Larence CE: Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J Am Stat Assoc. 1995, 90: 1156-1170. 10.2307/2291508.

    Article  Google Scholar 

  25. Neuwald AF, Liu JS, Lawrence CE: Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Sci. 1995, 4: 1618-1632.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  26. Motif discovery results – Discovered motifs, version 24. [http://fraenkel.mit.edu/Harbison/release_v24/final_set/Final_Motifs/]

  27. MacIsaac KD, Wang T, Gordeon DB, Gifford DK, Stormo GD, Fraenkel E: An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics. 2006, 7: 113-10.1186/1471-2105-7-113.

    Article  PubMed Central  PubMed  Google Scholar 

  28. Barbaric S, Munsterkotter M, Svaren J, Horz W: The homeodomain protein Pho2 and the basic-helix-loop-helix protein Pho4 bind DNA cooperatively at the yeast PHO5 promoter. Nucleic Acids Res. 1996, 24: 4479-4486. 10.1093/nar/24.22.4479.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  29. MDscan: A fast and accurate motif finding algorithm with aApplications to chromatin immunoprecipitation microarray experiments. [http://ai.stanford.edu/~xsliu/MDscan/]

  30. Cosmo – Constrained search for motifs in DNA sequences. [http://cosmoweb.berkeley.edu/]

  31. Bi C, Rogan PK: BIPAD: A web server for modeling bipartite sequence elements. BMC Bioinformatics. 2006, 7: 76-10.1186/1471-2105-7-76.

    Article  PubMed Central  PubMed  Google Scholar 

  32. van Helden J, Rios AF, J CV: Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acid Res. 2000, 28: 1808-1818. 10.1093/nar/28.8.1808.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Yi-Sian Liao for helping the execution of our prediction program. We also thank the reviewers for their valuable suggestions which improve the presentation of this paper. This work was supported by the National Science Council, Taiwan, under Contracts NSC 93-3112-B-007-004 and NSC95-3114-P-002-005-Y.

This article has been published as part of BMC Bioinformatics Volume 9 Supplement 12, 2008: Asia Pacific Bioinformatics Network (APBioNet) Seventh International Conference on Bioinformatics (InCoB2008). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S12.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Te-Ming Chen.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CL and WY developed and implemented the method. All authors participated in discussions and writing of the paper.

Electronic supplementary material

12859_2008_2712_MOESM1_ESM.pdf

Additional file 1: Figure S1 – Predicted results of the constraint-less Cosmo program and the comparison with our program. (PDF 325 KB)

12859_2008_2712_MOESM2_ESM.pdf

Additional file 2: Table S1 – Parameters used in our program to give the best predicted motifs for the 36 transcription factors. (PDF 30 KB)

12859_2008_2712_MOESM3_ESM.pdf

Additional file 3: Figure S2 – Comparison of the predicted results with the primary and alternative samplers. (PDF 545 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Lu, CC., Yuan, WH. & Chen, TM. Extracting transcription factor binding sites from unaligned gene sequences with statistical models. BMC Bioinformatics 9 (Suppl 12), S7 (2008). https://doi.org/10.1186/1471-2105-9-S12-S7

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-S12-S7

Keywords