Center for Studies in Physics and Biology, The Rockefeller University, New York, NY 10021, USA

School of Computer Science, McGill University, 3480 University Street, Montreal, QC, H3A 2A7, CANADA

Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA

Abstract

Background

This paper addresses the problem of discovering transcription factor binding sites in

Results

We propose an algorithm that integrates two important aspects of a motif's significance –

Conclusions

The results demonstrate that the new approach improves motif discovery by exploiting multiple species information.

Background

The discovery of novel transcription factor binding sites in regulatory sequences of genes has been an important scientific challenge for some years now. Computational approaches to this problem have come in two flavors. One class of methods looks for overrepresented motifs in sequences that are believed to contain several binding sites for the same factor (such as promoters of co-regulated genes) **Phy**logenetic **M**otif **E**licitation), for

PhyME integrates two different axes of information in evaluating a candidate motif's significance. One axis is that of

An important feature of PhyME is that it allows motifs to occur in (evolutionarily) conserved as well as unconserved regions in orthologous promoters, treating the two kinds of occurrences differently when scoring a motif. It does not require each binding site occurrence in one promoter to have an orthologous occurrence in any or all other species. As a result, PhyME affords some flexibility in terms of the evolutionary distances spanned by the input sequences. For instance, using a distantly related ortholog will help pinpoint motifs located in conserved regions but will not hamper the discovery of motifs absent from that ortholog.

Comparison with previous work

Traditionally, motif finding algorithms have treated input sequences as being independently generated, and searched for statistically overrepresented motifs in them. These algorithms

Another class of motif-finding methods take as input sets of orthologous sequences, either aligned

Some algorithms

There are algorithms that attempt to handle the two axes of information by a two-step approach. For instance, Cliften

A recent algorithm called orthoMEME (Prakash

Our approach is most similar to the algorithm PhyloGibbs (Siddharthan

The algorithm EMnEM (Moses

The algorithm PhyloCon (Wang and Stormo

Results

In this section, we first present the new algorithm, and then describe its evaluation on synthetic data, as well as biological data sets from various organisms.

Algorithm

Suppose that the input includes _{r }in the input, for which there is sequence data corresponding to each of the _{1}, _{2 },..., _{K}}, where _{i }is the orthologous sequence from species _{i}'s comes from species _{r}. The input also includes the motif length

PhyME first partially aligns the input sequences and identifies contiguous regions ("blocks") in each

Alignment of sequences

In this pre-processing step, PhyME computes the regions of high local similarity between _{i}. The assumption is that such regions are of common evolutionary origin, and any sequence outside them is independently evolved. PhyME runs the LAGAN alignment program of Brudno _{i}|_{r}), and extracts all ungapped aligned blocks of a certain minimum size (of the order of the motif's length) and percent identity, to serve as the blocks of common origin. This is illustrated in Figure _{1 }(the reference species), _{2 }and _{3}. An example of a block is region BC in _{1}, aligned with region UV in _{3}. Note how blocks can overlap in the reference species (BC overlaps with KL).

The input is now reorganized into two kinds of sequences:

Orthologous promoters and blocks of sequence conservation

Orthologous promoters and blocks of sequence conservation. Shaded areas represent ungapped aligned blocks. _{1 }is the reference species. (a) Alignment of input sequences and extraction of blocks. (b) Reorganization of input sequences.

1. The sequence from the reference species, with aligned blocks of the other species "hanging off" it. (In Figure _{2 }and blocks UV, WX of _{3 }aligned with corresponding blocks in _{1}.) Thus, any position in this sequence either has a single base from the reference species, or has an alignment of bases from multiple species, one of which is the reference species. This entire construct is called the "reference sequence".

2. Any subsequence not from the reference species, and bracketed by blocks on both sides. (e.g., regions NO, PQ, VW in Figure

PhyME fits the parameters of a probabilistic model on the reference and bracketed sequences simultaneously, and the desired motif comes out as a by-product of this training procedure, which is described next.

Hidden Markov Model

The probabilistic process that is assumed to generate sequences is described by a very simple _{m }of length _{b }of length 1. (The (^{th }entry of a weight matrix is the probability of emitting the base _{m }or _{b }according to their _{m }= _{b }= 1 - _{m}, _{b}, and _{T}_{b}) be the probability of generating _{b}. For a given

This log-likelihood ratio is the function optimized by PhyME – the parameters _{m }and _{b }is not trained during this maximization, rather it is pre-computed from

An important aspect of computing _{1}_{2 }... _{l}, and _{kj }is the probability of sampling base ^{th }position of _{1}_{2 }... _{l}, where each _{k }is either a single base, or an alignment of orthologous bases at a single position of the reference sequence. The subsequence probability _{e}(_{1}, _{2}, ... _{K}), where _{σ }is the nucleotide from species _{σ }were _{σ}'s occur in an alignment (_{e}(

Evolutionary model

This section describes the probabilistic evolutionary model that PhyME uses to incorporate phylogenetic relationships in the computation of the term _{e}(

where _{σ }is the nucleotide from species _{xy }= 1 if _{σ }is the neutral mutation probability between the ancestor and the species _{kσ}, and each such base is either passed unchanged to the species _{σ}) or mutated in species _{σ }and a new base selected with a frequency defined again by

In the general case, when Ψ does not have a star topology, Formula (1) can be written in a recursive manner. (See Methods.)

Expectation Maximization

The function _{m},

Let _{i }in generating the sequence(s), the expectation being over all parses. Similarly, let ^{th }position of the motif _{m}.

In the "M-step", two kinds of updates are made, using the values of _{m }is updated by solving, for each column _{β }(

The derivation of the update formulas is somewhat involved, and is described in the Methods section. The equations are solved using Newton's method, and the solution value of _{β }is used to update the (^{th }entry of _{m}, according to _{mkβ }= ^{uβ}. Newton's method involves computation of the first and second partial derivatives of log _{e}(_{m, k}), as described in Methods. In practice, we found that Newton's method always converges from a single initial condition, and the convergence almost always happens within 3–5 iterations.

The time complexity of (each E-M iteration in) PhyME is O(

Results on synthetic data

We first present the results of running PhyME on synthetic data. The experimental framework is largely borrowed from Wang and Stormo _{b }along each branch. (No insertions or deletions were included in this simulated evolution, for simplicity.) The motif instances are subjected to a common "motif mutation rate" _{m }, which is the probability of mutation of any position in a motif. The ancestral set of sequences is then removed and the remaining _{b}. MEME and GIBBS were run on the entire data set pooled together, ignoring the orthology of sequences.

Figure

Effect of varying the number of species (

Effect of varying the number of species (_{b }= 0.3, _{m }= 0.1.)

Figure _{b }was varied from 0.2 to 0.5 and the motif mutation rate _{m }was varied between 0.1 and _{b }- 0.1). As per expectation, the performance of each algorithm improved with decreasing _{m }(for a fixed _{b}). Interestingly, as _{b }decreases, the performance of PhyME for _{m }= 0.1 first improves and then falls down. The initial improvement is because the alignment step is able to find more conserved blocks with diminishing background mutation rate. However, when the latter approaches the motif mutation rate, the distinction (in cross-species conservation) between motif and background becomes weaker, hence performance goes down. In another set of experiments, we examined the effect of the alignment step on the performance. Sequences were created with _{b }= 0.3, _{m }= 0.1 and

Effect of varying background and motif mutation rates (_{b }and _{m }respectively) on motif-finding performance

Effect of varying background and motif mutation rates (_{b }and _{m }respectively) on motif-finding performance. Each point is an average over 10 experiments with synthetic data. (

Effect of the alignment step on motif-finding performance

Effect of the alignment step on motif-finding performance. The x-axis shows how many of the orthologous pairs of planted motifs are artificially unpaired in the alignment step. Each solid line represents a separate experiment. The squares plot the average score over eight experiments.

We also evaluated the effect of mis-estimates of the neutral mutation rates on performance. PhyME was run on random sequences created with experiment parameters _{b }= 0.3, _{m }= 0.1 and _{b }input to PhyME ranged from 0.1 (underestimate) to 0.5 (overestimate). We observed that underestimates of _{b }resulted in significantly greater performance degradation than overestimates of equal magnitude. (Data not shown.) For instance, using _{b }= 0.4 instead of the true value of 0.3 made no difference to the performance, whereas using _{b }= 0.2 resulted in 15 – 50% decline.

Results on biological data

In the following sections, we present results of running PhyME on real data sets from yeast, fly and human. The results are compared to MEME (run by pooling orthologous sequences together), orthoMEME

Yeast data sets

We first present some examples in yeast, where sequence data from four species,

See Methods for details on how orthoMEME, PhyloGibbs and EMnEM were run.

Figure

Effect of multiple species information on motif-discovery in the regulons RAP1, MIG1, CAR1, PHO4 and MCM1 in yeast

Effect of multiple species information on motif-discovery in the regulons RAP1, MIG1, CAR1, PHO4 and MCM1 in yeast. The y-axis plots the number of matches to the known motif, among the top

Figure

We find the scores of orthoMEME, as reported in Figure

We suggest caution in comparing PhyME's scores directly with those of the other programs, since we lacked expertise in choosing optimal parameter settings for them. This is particularly true for EMnEM, which has several parameters for modeling the evolution of motifs, and we lacked experience in setting these parameters optimally. We clearly have more expertise at using PhyME than the other programs, and this makes the comparisons biased. Our goal in these experiments was to provide some examples of how multiple species data can be exploited to improve performance, rather than assessing the different motif finding programs available. A proper comparative assessment of these programs has to address several challenges not addressed here. Such a task was undertaken for several motif finding programs, in the work of Tompa

Fly data sets

Next, we present results from fruitfly, where data from two species,

Well-defined weight matrices are available for both

Figure

Comparison of PhyME to 1 species and 2 species MEME, and to PhyloGibbs and EMnEM, for fly enhancers

Comparison of PhyME to 1 species and 2 species MEME, and to PhyloGibbs and EMnEM, for fly enhancers. The parenthetical number next to an enhancer name is the number of strong occurrences of the known weight matrix, in the

As in the yeast data sets, the comparison of scores between PhyME and the other programs should be interpreted with caution, since we lacked expertise in choosing optimal parameter settings for the other programs.

Human data sets

Finally, we present results of running PhyME on two data sets from human, where orthology with mouse and rat was utilized. These data sets were chosen because all of 15 different motif-finding programs tested in an assessment project (Tompa

Results on the human SP1 regulon

Results on the human SP1 regulon. (a) The known motif. (b) Motif reported by PhyME, using mouse and rat orthologs. (c) The phylogenetic tree used by PhyME.

The second data set used in our tests corresponds to the leucine zipper transcription factor c-Jun. The heterogeneous data set included 500 bp promoters for 11 human genes targeted by c-Jun, as well as orthologs from mouse and rat for 3 genes, from mouse only for 5 genes, and from rat only for the remaining three genes. PhyME was run exactly as in the previous data set. The known binding sites of c-Jun (in human) were aligned to produce a weight matrix that is shown in Figure

Results on the human c-Jun regulon

Results on the human c-Jun regulon. (a) The known motif. (b) Motif reported by PhyME, using mouse and rat orthologs.

Discussion

Issues in algorithm design

Alignment step

In the alignment step, PhyME extracts blocks of high sequence similarity between the reference species and each of the other species. Motif occurrences in such locally conserved regions are deemed orthologous, an assumption well-justified by traditional interpretations of sequence alignment. Conversely, all orthologous motif occurrences are assumed to be aligned in such blocks. This assumption is not always true since there may be orthologous motif occurrences not aligned by the alignment program, but it heavily constrains the space of orthologous motif occurrences, implying greater efficiency of the search algorithm. Moreover, the assumption does not mean that "true" orthologous occurrences in unaligned regions are ignored – they are merely treated as independent occurrences. Our experiments on synthetic data (see Results) demonstrate that the performance is not very sensitive to the correct alignment of all orthologous motif pairs. The blocks computed in the alignment step have to be with respect to the reference species, but the alignment itself need not be done in a pairwise manner. A multiple alignment of all sequences may be computed (e.g., with M-LAGAN _{i }may then be extracted. (The alignment step is implemented as a separate tool in PhyME, making it easy to switch to such alternative schemes.) Furthermore, the implementation may be modified in the future to drop the requirement of a reference species, since this requirement is not crucial to the motif finding step of PhyME. For instance, the alignment step may utilize the "Threaded Block Alignment" (TBA) program of Blanchette

Once the blocks of high sequence conservation have been identified, a possible strategy is to restrict attention to motif occurrences in these blocks, assuming that all functional binding sites must be evolutionarily conserved. However, this assumption is not true even for as closely related species as

Motif Finding

In the probabilistic process that is assumed to generate sequences, the transition probability does not depend on the previous choice(s) made during the process, meaning that the HMM is of zeroth order, nor on the position in the sequence, meaning that any information about spatial distribution of motifs is ignored. The model, unlike that of MEME, does not fragment the sequence into all

The evolutionary model described by Formula 1 applies only to phylogenies having a star topology. The general case of arbitrary tree topology is described in Methods. In Formula 1, if _{σ }is small (as for very closely related species), then finding different bases in orthologous positions has low probability _{e}(_{σ}~1, ∀ _{e}(

The neutral mutation rates (probabilities) along each branch of Ψ are input by the user and not trained during E-M. Training them on input data may cause overfitting, producing values that are largely inconsistent with the known evolutionary distances. The work of Moses

Note that the evolutionary model used by PhyME comes into play only in Equations 2 as the term _{e}(_{m}, _{e}(_{m},

Conclusions

We have developed a new algorithm, PhyME, that detects motifs in heterogeneous sequence data by integrating two important aspects of a motif's significance – overrepresentation and cross-species comparison – into one probabilistic score. We have evaluated different aspects of the algorithm on synthetic data, and demonstrated on some biological data sets that the new approach improves motif detection.

Methods

The evolutionary model

The evolutionary model makes the following assumptions: (i) Nucleotides in an aligned position are evolved from a common ancestor. (ii) The weight matrix applies to the common ancestor and to all descendants, a reasonable assumption given the propensity of DNA binding domains of proteins to evolve slower than cis-regulatory modules. (iii) All positions evolve independently, at equal rates, and the probability of fixation of a mutation _{1}, _{2}, .... _{K}} at the leaves. Let the vector _{1}, _{2}, ... _{K}), where _{σ }is the nucleotide from species _{e}(_{j }be the probability of a base in the parent species of _{j }be the vector formed by elements of

where _{j}, _{j }given that the base at the parent of _{ij }= 1 if

For the special case where Ψ has a star topology, Equation 4 reduces to Equation 1.

Training parameters in a HMM

Given a sequence _{i}}, the objective function to be maximized is _{b})), where _{b }represents the parameter values that only allow the background motif _{b }to be used by the HMM. The sequence _{1}_{2 }... _{L}, where each _{i }is either a single base or an alignment of orthologous bases at a single position. _{i }and their transition probabilities _{i}. Since _{b}) depends only on _{b}, which is assumed constant, we shall outline how to maximize log _{i }(i.e., the series of motifs chosen in the successive steps of the generative probabilistic process) is denoted by

We thus have

The maximization is iterative, with the ^{th }iteration computing a model ^{t + 1 }that improves the objective function from the current model ^{t}. In classical E-M fashion, let us define a function ^{t}) as

It is easily shown that log ^{t}) ≥ ^{t}) - ^{t}|^{t}). Thus, if we maximize ^{t}) over all ^{t}), or remain there if the local maximum has been reached. Let _{i}(_{i }occurs in the parse _{ikψ}(^{th }position of the matrix _{i}, in parse _{i }denote the length of _{i}. Then we have

which gives us

Note that the only the first term in this expression depends on _{i}, and only the second term depends on _{i}. Hence, we maximize each of these terms independently, with respect to the appropriate free parameters. We first maximize the term

Note that _{i }in

Thus the term to maximize is

Next, we maximize the second term:

Again, note that ^{th }position of the matrix _{i }while generating _{m }to be trained. Hence, we need to maximize _{m}. We can do this maximization with respect to each column _{kβ }denote the (^{th }entry of _{m}. Thus, for each _{kβ }(_{β }_{kβ }= 1. Using Lagrangian multiplier _{β }_{kβ }- 1).

Transforming to log variables _{β }= log _{kβ }to ensure that the _{kβ }remain positive during optimization, we then have the following necessary conditions for optimality (in addition to the constraint

We therefore have a system of five equations (including the constraint) in the variables _{β }(∀**u**, we solve this system of equations using Newton's iterative method. Let us write the above system of equations as **F**(**u**) = **0**, where **F**(**u**) = [[_{β}], _{λ}], with _{β }being the left side of Equation 9, and

Δ**u **= -(**J**(**u**))^{-1}**F**(**u**)

where Δ**u **is the change in **u **in the current iteration and **J **is the Jacobian matrix of **F**. The important terms in the computation of **F **and **J **are the first and second partial derivatives of log _{e}(_{m, k}) with respect to the _{β }variables. For this purpose, we need to compute _{e}(_{m, k}) and its first and second partial derivatives. Computation of _{e}(_{m, k}) uses the formulas 4 and 5. The partial derivatives can be computed recursively (over the tree Ψ) by using the chain rule of differentiation. These recursive computations are implemented in a bottom-up manner, so as to avoid redundant computations. Newton's method uses **F **and **J **to iteratively compute new values of **u**, until convergence. The Jacobian matrix **J **in our case is not positive definite, hence Newton's method is not guaranteed to converge. However, in practice, we found the method to always converge from a single initial seed. Upon convergence, the log variables _{β }are transformed back to _{kβ }= ^{uβ}. The procedure is repeated for each _{m }is then updated with the new values. This update, along with that given by Equation 8, is used iteratively to improve

Time complexity

The E-step computes _{m}. (This time complexity assumes that nodes in the phylogenetic tree have a fixed maximum degree.) Thereafter,

The M-step runs Newton's method to solve a system of equations, once for each column of _{m}. Each run of Newton's method goes through a small number (3–5) of iterations. Each iteration computes the first and second partial derivatives of log _{e}(_{m, k}) Each of these derivatives can be computed in **F **and **J **can be computed in

Thus, the running time of (each E-M iteration in) PhyME scales linearly with the length of the sequences, the length of the motif desired, and the number of species.

Implementation details

PhyME is implemented in C++ for Linux, and the source code will be made freely available at

PhyME uses the LAGAN alignment tool of Brudno

The E-M algorithm is guaranteed to converge only to a local optimum. To address this problem, the motif-finding step is executed a fixed number of times, each time using a randomly chosen substring of the input sequence as the "seed" to initialize _{m}, and truncating the E-M procedure after a small number of iterations. The seed with greatest score

PhyME considers occurrences on both strands by introducing a new weight matrix _{r}, and an associated transition probability _{r}, in the HMM parameters. The weight matrix is constrained to be the reverse complement of _{m}. The model has a fixed bias of planting the motif in one orientation versus the other, and this bias is trained from the data. PhyME also has the option of capturing local correlations in background nucleotide composition. To implement a ^{th }order Markov background, PhyME uses a special background weight matrix that is of length 1 but uses the knowledge of the previous

Performance score in experiments with synthetic data

We use the following score for measuring the performance of a motif-finding algorithm on synthetic data. Let _{1}, _{2}, ... _{n}} be the set of _{mi }be the set of positions in sequence _{i }that are occupied by an occurrence of ^{k}, and are evaluating the motif ^{r }reported by an algorithm. The performance score Φ is defined as follows:

In other words, it is the number of positions, over all sequences, where occurrences of the known and reported motifs overlap, divided by the total number of positions at which the known

Details of experiments with biological data sets

Yeast

The genes regulated by each transcription factor are listed in SCPD. For each such "regulon", the known sites and the known weight matrix were extracted from SCPD. Also, 800 bp long upstream sequences of the genes in each regulon were extracted (for

PhyME was run with the ^{rd }order Markov background ("-N 3") trained on the full complement of yeast promoters was used, as with PhyME and MEME. The "loose align" option ("-D 1") and the "stop after anneal" option ("-X") were used. These options were suggested by an author of PhyloGibbs (Rahul Siddharthan, personal communication). We experimented with a different value for the mutation probability ("-G 0.7"), with no improvement, except in the RAP1 regulon. EMnEM was run with default parameters, the motif length being input through the "-w" parameter. Phylogenetic trees were derived from each input promoter, using the fastDNAML software of Olsen

Fly

The locations of cis-regulatory modules involved in body-patterning of the early embryo in

OrthoMEME was run as in the yeast data sets (see above), except that the "tcm" mode was used now. PhyloGibbs was also run as in the yeast data sets, except that we used a mutation probability of 0.5 ("-G 0.5"), and a 2^{nd }order Markov background ("-N 2"), trained on non-coding regions in fly. We also experimented with a higher value of the mutation probability, and tried specifying the initial number of occurrences per motif ("-I") differently, with no clear improvement. EMnEM was run with the Jukes-Cantor evolutionary model ("-m 0") and with the relative rate of motif to background ("-p") set to 0.5 and 0.25 in separate runs. The expected number of motifs was set to

Human

The genes comprising each regulon were obtained from TRANSFAC

Authors' contributions

All authors participated in initial discussions leading to the key idea of using Expectation-Maximization and a phylogenetic model to search in a weight-matrix space. SS designed the algorithm details, derived the E-M calculations, implemented and tested the program, and drafted the manuscript. All authors contributed to, read and approved the final manuscript.

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under grant DBI-0218798, in part by the National Institutes of Health under grant R01 HG02602, and in part by a Keck Foundation fellowship. We are very grateful to Amol Prakash for experiments with orthoMEME, and to Rahul Siddharthan for help in running PhyloGibbs. Several useful discussions on the topic with Eric Siggia are also acknowledged. An anonymous referee, who suggested several useful changes to the manuscript, is also thanked.