Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA, USA
Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA
Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
Abstract
Background
Despite the diversity of motif representations and search algorithms, the
Results
We present a novel ensemble learning method, SCOPE, that is based on the assumption that transcription factor binding sites belong to one of three broad classes of motifs: nondegenerate, degenerate and gapped motifs. SCOPE employs a unified scoring metric to combine the results from three motif finding algorithms each aimed at the discovery of one of these classes of motifs. We found that SCOPE's performance on 78 experimentally characterized regulons from four species was a substantial and statistically significant improvement over that of its component algorithms. SCOPE outperformed a broad range of existing motif discovery algorithms on the same dataset by a statistically significant margin.
Conclusion
SCOPE demonstrates that combining multiple, focused motif discovery algorithms can provide a significant gain in performance. By building on components that efficiently search for motifs without userdefined parameters, SCOPE requires as input only a set of upstream sequences and a species designation, making it a practical choice for nonexpert users. A userfriendly web interface, Java source code and executables are available at
Backgound
The computational discovery of DNA binding sites for previously uncharacterized transcription factors in groups of coregulated genes is a wellstudied problem with a great deal of practical relevance to the biologist, since such binding sites provide targets for mutational analyses (for reviews see
The positionspecific variability of transcription factor binding sites makes their
Motif finding programs rely on a search algorithm to optimize a motif model (an abstract representation of a set of transcription factor binding sites). Most recent programs represent motifs as position weight matrices (PWMs), which record the frequency of each base at every position in the motif. Other motif finding programs have relied on the use of consensus motif models (in which every base is represented by a letter of the 15letter IUPAC code, which accounts for degeneracies as well as single bases) or
Program parameters (such as motif length, number of occurrences and orientation) that cannot be reasonably specified by the user without prior knowledge about the true binding sites are referred to as nuisance parameters
Nuisance parameters complicate the interpretation of performance comparisons as well. A recent largescale performance comparison between thirteen different motif finding tools used expert knowledge in setting the parameters for every program
A key result of the Tompa,
Ensemble methods, well known in the machine learning community
To the best of our knowledge, only one study to date has explored ensemble learning in motif finding. Hu, Li and Kihara
Here we present a novel ensemble motif finder based on a different conceptual approach. Rather than randomly restarting the same search algorithm or comparing multiple search strategies that all search for the same global optimum (and are potentially vulnerable to the same local optima), our algorithm assumes that the "biological significance surface" primarily consists of three local optima, and that one of these peaks represents the global optimum. Thus, our ensemble uses three specialized algorithms whose search spaces restrict them to each of these three local optima (BEAM for nondegenerate motifs, PRISM for degenerate motifs and SPACER for bipartite motifs). We have previously demonstrated that the greedy search strategies employed by each of these methods allow them to reliably search their respective motif domains without the use of nuisance parameters, as the algorithms themselves efficiently optimize the parameters that are typically forced on the users
The results of these component algorithms are then combined using a learning rule that is simply the maximum score returned by each component algorithm. To make comparisons possible, the motif scores returned by each algorithm are penalized according to the complexity of the motif. The resulting ensemble algorithm, SCOPE, has no nuisance parameters and performs significantly better than its component algorithms. In addition, we find that SCOPE performs favorably compared to a diverse range of existing methods and is robust to the presence of extraneous sequences in its input.
Results
Algorithm
SCOPE takes as input a set of sequences
SCOPE has three component algorithms, BEAM, PRISM and SPACER, which search for nondegenerate, short degenerate, and long, highly degenerate and "gapped" motifs, respectively (Figure
Flow diagram for SCOPE
Flow diagram for SCOPE. BEAM and SPACER are run independently; PRISM runs on the top 100 motifs output by BEAM. For yeast (whose upstream regions are standardized to 800 bp), BEAM and PRISM use the overrepresentationKS objective function (so/ks), while SPACER's slower running time requires the simpler overrepresentation objective function (so). The top 5 motifs from SPACER are rescored using the combined objective function. For bacteria and
Details of the algorithms, data sets and statistical analyses. This file contains the details needed to replicate the experiments and the statistical analyses, as well as an overview of the component algorithms.
Click here for file
Each of SCOPE's three component algorithms seeks to maximize the same objective function over a different class of motifs. Let
Thus, if
where 
Testing
Evaluation of objective functions used by SCOPE
Each component algorithm in SCOPE efficiently searches its restricted search space, keeping SCOPE's runtime low (average runtime on our datasets were about one minute). This efficiency allowed us to explore several objective functions for scoring the statistical significance
To establish which objective function (or combination of functions) was most suitable, we tested each objective function independently of SCOPE, using a subset of the
Correlation between accuracy and
Correlation between accuracy and
These plots demonstrate that overrepresentation is a closer approximation to biological relevance than coverage or KS alone. Adding KS to overrepresentation modestly improved the correlation by 13% (as compared to overrepresentation alone) to R^{2 }= 0.28. To assess the degree of class separation achieved by the two objective functions, we ranked the sampled sixmers by
This analysis suggests that more complex objective functions may provide a better estimate of biological significance than the overrepresentation objective functions commonly used. We thus chose to run SCOPE using the overrepresentationKS combined objective function on the
The surprisingly low correlations between
Evaluation of SCOPE performance and ensemble learning scheme
We first assessed the performance of the optimized SCOPE framework on synthetic datasets (for details, see Additional file
Performance at different overrepresentation
Performance at different overrepresentation
While synthetic test sets are useful in algorithmic development and initial testing, the results of such tests must be taken with a grain of salt, as they are highly dependent on the model used to generate the test sets
SCOPE's reported accuracy was significantly higher than any of its component algorithms (Table
Summary results for performance comparisons between SCOPE and its component algorithms, on all regulons. A "Win" is a regulon for which a program had the highest accuracy and that accuracy was at least 0.10. Programs in a twoway tie are credited with 0.5 wins each, so by construction, SCOPE can at best share a win with one of the other programs. A perfect winnertakeall ensemble method would have the same number of wins as all the component algorithms combined. A "clear win (loss)" is a regulon for which SCOPE's accuracy was at least 0.10 higher (lower) than the other program. The pvalue reported for the paired ttest was Bonferronicorrected to account for multiple (three) comparisons.
SCOPE
BEAM
PRISM
SPACER
Average
0.24
0.17
0.18
0.17
Stderr
0.02
0.02
0.02
0.02
Wins
20
13
11
17
scores ≥ 0.50
8
8
6
5
scores ≥ 0.33
21
15
14
14
scores ≥ 0.20
39
23
23
26
Regulons returned
78
78
78
78
clear win for SCOPE vs

28
18
19
clear loss for SCOPE vs

6
2
3
ttest pvalue

0.002
0.002
0.004
Average and standard error of sensitivity and PPV for the component algorithms of SCOPE on all 78 regulons
Average and standard error of sensitivity and PPV for the component algorithms of SCOPE on all 78 regulons. Bars represent standard error.
An ensemble motif finder with a learning rule that is no better than random will provide an accuracy that is equal to the average of its three component algorithms. To provide a basis for evaluating the performance of SCOPE's learning rule, we constructed an ensemble learning method (referred to here as BASELINE) from the results of BEAM, PRISM and SPACER, by randomly selecting one of the accuracies from these three programs for each regulon. Over 120,000 trials, BASELINE's average performance on this dataset was 0.176 with a standard deviation of 0.013. BASELINE's average score never exceeded that of SCOPE (
Of course, SCOPE's learning rule is extremely simple, and more complex learning rules may allow SCOPE to approach its theoretical upper bound. One rule that may prove effective is to weight the output of each algorithm according to (for example) the frequency of occurrence of each class of motif (nondegenerate, short degenerate or long degenerate) in the species or by learning the appropriate weights on a representative training set, creating, in effect, a Naïve Bayesian Network. The training of a more complex learning rule must, however, be performed in a crossvalidation framework, and the size of the available dataset of regulons will place a practical limit on the complexity of the learning rule that can be devised.
Comparison with other motif finding programs
To provide a frame of reference for SCOPE's performance, we ran ten other popular motif finders on these datasets (for details and references see Table
Motif discovery algorithms used in the performance comparison. Nuisance parameters are parameters that cannot be precisely defined without knowledge of the true binding sites (such as motif length, number of occurrences and orientation). For MotifSampler and wConsensus, the lower part of the range indicates required parameters, while the upper part indicates the total number of parameters, including "power user" parameters that the program authors stress should typically be left as default. Motif model abbreviations: cons = consensus; PWM = position weight matrix; mis = consensus with predefined number of allowed nonpositionspecific mismatches.
Program
# Nuisance Parameters
Motif Model
Search Strategy
Citation
Oligo analysis (RSAT)
3
cons
Exhaustive enumeration of short and bipartite oligos. Clusters overlapping motifs. Uses a binomial approximation to the hypergeometric score, similar to the overrepresentation objective function.
[14, 33, 34]
Yeast Motif Finder (YMF)
2
cons
Exhaustive enumeration of short and bipartite oligos. Alphabet is {ACGTYR}. Uses the Normal approximation to the hypergeometric function, similar to the overrepresentation objective function.
[35]
AlignAce (AA)
2
PWM
Gibbs sampling to optimize a Maximum a Posteriori (MAP) score.
[36]
MotifSampler (MS)
3–5
PWM
Gibbs sampling with higher order Markov model.
[37]
BioProspector (Biopros)
7
PWM
Gibbs sampling with higher order Markov model. Designed for long and bipartite motifs common in prokaryotes.
[16, 38]
MEME
4
PWM
Expectation Maximization over a modified information content.
[39]
Improbizer (Imp)
8
PWM
Expectation Maximization. Uses 2nd order Markov model and optionally accounts for positional restrictions using a Gaussian model.
[40]
MITRA
1
mis
Treebased search for long bipartite motifs with many mismatches. Uses a hypergeometric score similar to the overrepresentation objective function.
[41]
wConsensus (wCons)
1–13
PWM
Greedy enumeration to maximize information content. Infers motif length.
[42]
Weeder
4
mis
Bounded enumeration using a suffix tree. Tries all motif lengths from 6–12.
[43]
SCOPE has no useradjustable parameters, although its component algorithms do contain a number of internal parameters ("hyperparameters") that govern their search over common nuisance parameters. On synthetic datasets, we found SCOPE's component algorithms to be quite robust to the settings of these hyperparameters. We have therefore fixed those parameters to reasonable values and do not expose them to the user
We compared the motif finding programs using the criteria set forth in Sinha and Tompa, including average accuracy and the number of total wins (highest accuracy on a regulon, where that accuracy is at least 0.1)
Performance comparisons
Performance comparisons. (a) Mean and standard error of accuracy for each of 78 regulons. (b) Cumulative distribution of accuracy for each program. (c) Fraction of regulons with a clear outcome (margin of difference in accuracy between two programs was greater than 0.10) won by SCOPE. Program abbreviations and details in Table 2; performance details in tables S1 and S2 in Additional file
A formal statistical analysis found that SCOPE's performance margin over the other motif finders run on this dataset was statistically significant at p < 10^{5 }(for details, see Additional file
SCOPE's high accuracy was a reflection of both high PPV and high sensitivity (Figure
(a) Average and standard error of sensitivity and PPV for each program on all 78 regulons
(a) Average and standard error of sensitivity and PPV for each program on all 78 regulons. In cases where the program failed to return a result, the sensitivity is 0 and the PPV is undefined. Cases where a program did not support the species were not included. (b) Ranks on this plot were computed by taking the average of sensitivity and PPV ranks for each program.
Performance in the presence of extraneous upstream sequences
In practice, microarray coexpression data are often used to identify genes in a particular regulon. This approach identifies genes that are either directly or indirectly regulated by the transcription factor of interest. Therefore, sets of genes identified from coexpression data may often contain multiple extraneous upstream sequences. Adding sequences that do not contain binding sites decreases the signaltonoise ratio, making motif finding more difficult
We thus tested SCOPE's performance on regulons containing additional extraneous upstream sequences. For all 33 regulons in the SCPD dataset, we added randomly selected upstream
Robustness of SCOPE performance on
Robustness of SCOPE performance on
Discussion
The field of motif finding is saturated with a large number of algorithms representing myriad search strategies, objective functions and motif models. Yet remarkably, performance comparisons consistently reveal disappointing performance for motif finders and fail to find any statistical significance between them. A brief survey of the perregulon results of these performance comparisons yields two key observations: (1) there are many regulons for which a large number of programs find a small portion of the binding sites (though not necessarily the same portion); and (2) every program has a respectable number of "wins" (i.e. every program is the best existing program on some handful of regulons
Such observations are common in many machine learning applications, and are the direct result of complex search spaces that force restrictions on either the search strategy or the representation of the solution space (in this case, the motif model used to represent the motifs). For example, YMF and RSAT are guaranteed to find the optimal solutions in their motif space (fixedlength motifs with limited degeneracies), but that space is limited to the point that optimality provides no clear advantage over the other methods. Conversely, the PWMbased methods have an apparently more powerful motif model
The HLK ensemble method
The second observation, that all motif finders win some number of regulons and often perform roughly the same on average, is broadly consistent with a theorem in the Machine Learning field referred to as the No Free Lunch Theorem
In this light, SCOPE can be seen as leveraging the second key observation by embracing the No Free Lunch Theorem: rather than boost average performance by taking the average results of three general purpose algorithms, SCOPE uses highly specialized algorithms and assumes each will perform strongly on some regulons and weakly on others (and that the unified scoring metric can tell the difference). The working hypothesis is, in effect, that the local maxima are
Of course, based on the No Free Lunch Theorem, SCOPE's performance averaged over all theoretically possible datasets will still converge to that of the other motif finding approaches (including random guessing). As the physical properties of transcription factors will inevitably constrain the structure of their binding sites, biologically relevant datasets comprise a subset of the space of all theoretically possible sequences. Our test set of 78 regulons was selected in a blinded manner (for details, see Additional file
These observations are not offered as definitive proof that there are only three classes of motifs; rather, they show that power can be gained by identifying distinct motif classes and combining specialized algorithms with a unified scoring rule. It is possible that more power could be gained by identifying other distinct motif classes and adding algorithms that explicitly search for those classes. For example, Zinc finger transcription factors have been demonstrated to bind three triplets of nucleotides which overlap at their third base positions
SCOPE may also serve as a complementary approach to the HLK method. For example, the parameters of many methods can be set to search for specific classes of motifs (such as bipartite versus nonbipartite motifs). Thus, analogous to the ensemble method described in this paper, one may build a hierarchical ensemble that first searches each motif class by the HLK method using a number of algorithms or random restarts, and then uses the SCOPE method to choose the best result from among the motif classes. One constraint associated with such an approach is the runtime. A second constraint associated with a hierarchical ensemble learning method is the multiplicative increase in the number of parameters associated with it, though this problem may be ameliorated by the use of parameterfree algorithms that employ restricted search spaces.
An important factor to consider when taking the best of multiple runs is the relative size of the search space. Certainly to maintain statistical validity, some correction must be made for multiple hypothesis testing. Furthermore, the effects of multiple testing may bias the results in favor of one of the component algorithms. To ensure statistical validity and avoid such a bias, we developed a simple Bonferronilike correction, which penalized every proposed motif proportional to its length and degree of degeneracy, resulting in a modest improvement of 8% in SCOPE's accuracy.
Although our test set of 78 regulons gave us enough power to find significance between SCOPE and its components or other algorithms, it did not provide enough power to disentangle the effects of small improvements (such as the Bonferroni correction, the objective function that takes position bias into account, or scoring motifs based off one or both strands), especially in the rigorous crossvalidation framework necessary to decipher precisely which aspects contribute significantly to the performance. Nevertheless, as larger datasets become available, SCOPE's efficient search strategy makes it an ideal platform for exploring the effect of focused improvements to the motif finding approach described, such as the use of complex objective functions that may better approximate biological significance.
The comparisons to other motif finding programs in this study are provided to place SCOPE's performance in the broader context of the motif finding field, particularly when viewed from the standpoint of the practicing "bench" biologist. Any performance comparison must be interpreted with caution, since the results are highly dependent on the dataset used, the conditions of the testing and the metrics used for evaluation. With this in mind, we sought to evaluate a wide representation of motif finders on a large number of regulons using performance metrics consistent with previous studies
We note that all the motif finders tested, including SCOPE, performed poorly on the
Conclusion
Ensemble methods hold the potential for providing improvements in motif finding accuracy without resorting to the use of additional data (such as phylogenetic information or characterization of the domain structure of the transcription factor), which are not always available. Typically, ensemble learning methods are plagued with certain liabilities, such as increased runtimes, logistical complexity and a multiplicity of nuisance parameters, all of which grow with the number of programs run. In the machine learning field, ensemble methods have coexisted for many years with nonensemble methods, with no clear superiority having been established between the two.
SCOPE serves as a proofofconcept, demonstrating an efficient and effective approach to ensemblebased motif finding. By dividing the search space into tractable domains, SCOPE mitigates the potential liabilities associated with ensemble methods, resulting in a program that is capable of finding
Methods
Accuracy, Sensitivity and Positive Predictive Value
Each algorithm's accuracy for each regulon was measured via the
This metric therefore takes both false positives and false negatives into account at the level of the individual bases that are actually covered by the motif. As in Hu et al.
Objective functions for Statistical Significance
In line with other motif finders, we have used statistical significance as a surrogate for biological significance. Since the latter cannot be defined without data that obviates the need for computational motif finding, objective functions that approximate biological significance are critical. In this section, we detail the objective functions we used and their effect on SCOPE's performance. For any motif
Overrepresentation
The most common statistical test in motif finding is based on overrepresentation, which can be roughly defined as the probability that a motif
where λ is the expectation of
Coverage
A simple modification to the overrepresentation objective function is
Positional bias
Transcription factors often require their binding sites to be located in a restricted range relative to the start of transcription. One well known example is TBP (TATAbinding protein), which localizes the RNA polymerase complex by binding the TATAbox motif roughly 25 bases upstream of the transcription start site
In the context of motifs, we defined the test sample
Combining overrepresentation and positional bias
Since overrepresentation and KS are independent, the probabilities can simply be multiplied together to yield the probability of randomly sampling a motif with a given degree of overrepresentation and positional bias.
Motif orientation
Many transcription factors will bind motifs on either DNA strand. Others, such as the general transcription factor TBP (TATABinding Protein), require a specific orientation and will only function if bound to motifs on a specific DNA strand
Availability and requirements
A userfriendly web server, source code and executables are available at the project website.
• Project name: SCOPE
• Project home page:
• Operating system(s): Platform independent
• Programming language: Java
• Other requirements: Java 1.3.1 or higher
• License: Free for academic use
• Any restrictions to use by nonacademics: License required
Authors' contributions
AC proposed the original method, designed the experiments and helped design the web front end. JMC implemented SCOPE, contributed to the methodology, and helped design the experiments and the web front end. AC and JMC drafted the manuscript. RSK managed the performance comparison. RHG conceived the overall outline of the study, provided funding, contributed to the methodology and helped design the web front end. All authors contributed to, read and approved the final manuscript.
Acknowledgements
The authors would like to thank Nelson Rosa Jr., for his help in automating the performance comparison, Kankshita Swaminathan for help with collating regulons, and Charlie DeZiel and Nate Barney for their work on the web front end. This research was supported by a grant to RHG from the National Science Foundation, DBI0445967. JMC was supported by a National Human Genome Research Institute grant, T32 HG00035.