Department of Computer Science, University of Western Ontario, N6A 5B7, London, ON, Canada
Department of Mathematics, Ryerson University, M5B 2K3, Toronto, ON, Canada
Abstract
Background
DNA oligonucleotides are a very useful tool in biology. The best algorithms for designing good DNA oligonucleotides are filtering out unsuitable regions using a seeding approach. Determining the quality of the seeds is crucial for the performance of these algorithms.
Results
We present a sound framework for evaluating the quality of seeds for oligonucleotide design. The
Conclusions
Our work confirms another application where multiple spaced seeds perform the best. It will be useful in improving the algorithms for oligonucleotide design.
Background
An oligonucleotide is a short DNA or RNA sequence. It is usually designed to hybridize with a unique position in a target sequence. In this way the target sequence can be uniquely identified using the oligonucleotide as a probe. DNA oligonucleotides have many applications such as gene identification, PCR (polymerase chain reaction) amplification, or DNA microarrays.
Many software programs have been written to construct good DNA oligonucleotides, such as ProbeSelect
Seeds were made highly popular by the sequence alignment program BLAST
It is intuitively clear that several seeds, with different distribution of the matches, may detect more similarities. This idea has been used in PatternHunter II
Our goal is to show that multiple spaced seeds perform the best for the task of oligonucleotide design. We shall describe a sound framework to evaluate the quality of various types of seeds for oligonucleotide search. Two aspects are to be considered: accuracy and efficiency. Accuracy is the ability of a seed to distinguish between regions that are similar with a given one and those that are not. Efficiency concerns the speed of this process.
To the best of our knowledge, there is only one study on this problem, due to Chung and Park
We introduce a different approach here and show that the multiple spaced seeds actually provide the best accuracy. The accuracy increases with the number of seeds but this comes at the price of reduced efficiency. It is interesting to notice that spaced seeds are both more accurate and more efficient than contiguous seeds.
Methods
In this section we describe our framework for comparing various types of seeds for oligonucleotide design. We first introduce seeds and describe their working mechanism. We also introduce seed sensitivity and explain the intuitive advantages of multiple seeds.
Seeds
A DNA sequences is seen as a string over the alphabet Σ = {
An example of a hit is shown in Figure
Hit example
Hit example. An example of a hit:
A hit means there is a chance for an actual similarity. The ability of a seed to detect similarities is called
Multiple spaced seeds
Multiple spaced seeds are sets of seeds. A multiple spaced seed containing
The sensitivity alone is not sufficient to assess the quality of a seed. That is because we can increase the sensitivity as much as we like simply by decreasing the weight. However, that would cause an increase in the number of random hits. We have therefore a trade off: decreasing the weight increases the sensitivity but also the number of random hits whereas increasing the weight decreases both. Weight 11 achieves a good balance and this is why it is used in the above mentioned programs.
More precisely, consider a single seed
As an example, in Figure
Seed sensitivity
Seed sensitivity. Sensitivity curves for multiple spaced seeds with 1, 2, 4, 8, and 16 seeds of increasing weights: 11, 12, 13, 14, and 15, respectively. The length of the random region is
One should be aware however, that more memory is required for a higher number of seeds in order to store more hash tables and this enforces an upper bound on the number of seeds that can be used.
Accuracy and Efficiency
The oligo design problem requires the ability to construct oligos that will hybridize only at unique positions in a given sequence. That is, for a given sequence (a potential oligo), we need to be able to accurately distinguish sequences that are similar with it from those that are not. Our setup will therefore include precisely constructed sequences of both types which need to be distinguished.
Assume we have a set of sequences, which are divided, as in
Data set example
Data set example. One main sequence and its associated secondary sequences, marked as oligos or nonoligos. The last column shows which of the secondary sequences are hit by the seed 11*1**111 and the number in parentheses gives the number of hits.
We define next a measure of the quality of a given seed. A widely used measure for the accuracy of a test is the




The precision
We shall define the
Note that, in binary classification, "recall" is called also "sensitivity." To avoid any confusion, we use the term "sensitivity" only with the meaning of "seed sensitivity" as defined in the "Seeds" subsection above.
In our example in Figure
The approach in
The
In our example in Figure
The efficiency of
Results and Discussion
We compare in this section various types of seeds using the framework constructed above and then discuss the obtained results.
Data sets
Data sets were built using the OligoGenerator program of
 identity level with target sequence: 85%
 maximum stretch of continuous matches:15bp
 hybridization free energy: 30 kcal/mol
The difference between our data set and the one of
 identity level with target sequence: 85%
 maximum stretch of continuous matches: 20 bp
 hybridization free energy: 40 kcal/mol
Seeds
Computing optimal spaced seeds is a hard problem; see
Computing optimal multiple spaced seeds is significantly harder than single seeds. Even computing an optimal 2seed of usable weight and length is infeasible. Therefore, many heuristic algorithms have been designed to compute multiple spaced seeds but they are all exponential, with the exception of SpEED
Using SpEED, we have computed highly sensitive multiple spaced seeds with 2, 4, 8, and 16 seeds. The parameters used by SpEED for computing the seeds are derived from those of the oligos. That is,
This file contains all the seeds used in our tests. The contiguous, transition, and single spaced seeds are the same as in
Click here for file
Comparison
For each of the two cases, 50mers and 70mers, we have computed the average accuracy and efficiency for all seeds on the data sets generated. The highest accuracy values for each seed type are shown in Table
Highest accuracy values
50mer data sets
70mer data sets
seed type
max. accuracy
weight
seed type
max. accuracy
weight
mean
stdev.
mean
stdev.
contiguous
0.8760
0.0011
10
contiguous
0.8822
0.0003
11
transition
0.8856
0.0013
12
transition
0.8985
0.0002
13
1seed
0.8888
0.0017
12
1seed
0.9009
0.0001
13
2seed
0.9018
0.0019
13
2seed
0.9082
0.0006
14
4seed
0.9051
0.0014
15
4seed
0.9138
0.0003
16
8seed
0.9080
0.0018
16
8seed
0.9176
0.0008
17
16seed
0.9117
0.0013
17
16seed
0.9191
0.0006
19
The highest accuracy for each seed type is given; for the 50mer data set in the left table and the 70mer data set in the right table. Both the mean and standard deviation are given. The weight of the seed that achieves the highest accuracy in a given category is given in the last column.
A complete picture is given in Figures
Precision and recall for 50mer data sets
Precision and recall for 50mer data sets. The left plot shows the precision and the right plot the recall values for the 50mer data sets.
Accuracy for 50mer data sets
Accuracy for 50mer data sets. The left plot shows the accuracy values for the 50mer data sets. The right plot shows the top part of the curves to emphasize the differences.
Precision and recall for 70mer data sets
Precision and recall for 70mer data sets. The left plot shows the precision and the right plot the recall values for the 70mer data sets.
Accuracy for 70mer data sets
Accuracy for 70mer data sets. The left plot shows the accuracy values for the 70mer data sets. The right plot shows the top part of the curves to emphasize the differences.
The increased accuracy comes at a price in efficiency. Figure
Efficiency
Efficiency. The efficiency values for the 50mer data sets are shown in the left plot and for 70mers in the right plot.
This file contains the complete results of our tests for all the 50mer data sets.
Click here for file
This file contains the complete results of our tests for all the 70mer data sets.
Click here for file
In the process of designing oligonucleotides, similar regions need to be identified and eliminated in order to keep the unique ones, out of which oligos can be chosen. For this purpose, a very high recall is desired. Therefore, we shall also rank the seeds by setting a lower bound on the recall and then considering only the accuracy of those seeds that satisfy this lower bound. The values for the bounds on the recall values are 0:86, 0.87,..., 0.99. Figure
Accuracy for bounded recall
Accuracy for bounded recall. The accuracy values for bounded recall values are given for the 50mer data sets in the left plot and 70mers in the right one. For each value × on the abscissa, only the accuracy values of seeds with recall at least
Efficiency for bounded recall
Efficiency for bounded recall. The efficiency values for bounded recall values are given for the 50mer data sets in the left plot and 70mers in the right one. For each value
A last comment concerns the transition seeds. A single transition seed is slightly less accurate than a single spaced seed. However, this reason is not sufficient to rule out multiple transition seeds. Our analysis focuses on multiple spaced seeds since we were in position to compute very good ones. Multiple transition seeds should be investigated further.
Discussion
As explained earlier, accuracy and efficiency cannot be mixed. Taken separately, they show clearly the ranking. Together, they give the trade off: better accuracy comes with a price in efficiency (except when contiguous seeds are replaced by transitionconstrained or single spaced seeds).
Conclusions
We have presented a sound framework to compare seeds for oligonucleotide design. It is known that multiple spaced seeds perform better than the other seeds in many applications but the requirements of oligo design are different. We have proved that, also in this application, multiple spaced seeds have the highest accuracy. This corrects the conclusion of Chung and Park
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
LI and SI identified the error in the approach of
Acknowledgements
LI and SI were each supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC).