Abstract
Background
To meet the needs of gene annotation for newly sequenced organisms, optimized spaced seeds can be implemented into crossspecies sequence alignment programs to accurately align gene sequences to the genome of a related species. So far, seed performance has been tested for comparisons between closely related species, such as human and mouse, or on simulated data. As the number and variety of genomes increases, it becomes desirable to identify a small set of universal seeds that perform optimally or nearoptimally on a large range of comparisons.
Results
Using statistical regression methods, we investigate the sensitivity of seeds, in particular good seeds, between four cDNAtogenome comparisons at different evolutionary distances (humandog, humanmouse, humanchicken and humanzebrafish), and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. In addition, we find that with high confidence good seeds for more distant comparisons perform well on closer comparisons, within 98–99% of the optimal seeds, and thus represent universal good seeds.
Conclusion
We show for the first time that optimal and nearoptimal seeds for distant speciestospecies comparisons are more generally applicable to a wide range of comparisons. This finding will be instrumental in developing practical and userfriendly cDNAtogenome alignment applications, to aid in the annotation of new model organisms.
Background
The next few years are expected to bring a significant increase in the number of available genomes, driven by advances in sequencing technologies [1]. As genome sequencing projects outpace the generation of native mRNA and protein sequences, gene annotation projects for these genomes will need to rely instead on cDNA information from other species. While existing alignment programs align cDNA and the corresponding genomic sequences accurately, they are inadequate for crossspecies comparisons [2]. Beginning with blast [3,4], most alignment programs have used a seedandextend technique to produce local alignments, starting from exact or nearexact word matches (seeds) between the two sequences and extending them to a local alignment in several stages. Blast uses an exact match of 11 contiguous positions, represented by a vector of 1s (11_{c }= 11111111111). Such a seed is called continuous. More recently, spaced seeds have been introduced, which allow wildcard positions in the seed pattern, marked with 0s. For instance, Kent and Zahler [5] used a seed that allowed for mismatches at the wobble codon position (WABA_{12 }= 11011011011011011) to increase the sensitivity of cDNAgenomic sequence alignment. For the same weight, or number of 1 positions, spaced seeds were shown to perform better than continuous seeds in most cases [6].
Recently, Ma et al. [7] introduced a framework for predicting the sensitivity of a spaced seed given a model of alignment, and showed how to determine the theoretical seed sensitivity with an efficient dynamic programming method [8]. Given a Bernoulli model of alignments with the parameter p (i.e., the probability that any one alignment position is a match) estimated from the average sequence identity, they determined optimal seeds for humanmouse genomic sequence alignment. These seeds were later implemented in the programs PatternHunter and blastz [9]. Subsequent studies used increasingly complex alignment models, such as the order 1 inhomogeneous 3periodic Markov models in [6,10] or the higher order (3–5) inhomogeneous Markov models of [11], often in the specialized context of coding sequences. Extensions of the seed models were proposed to further improve sensitivity or extend the range of applicability to nonnucleotide sequences, for instance multiple spaced seeds [12], or vector seeds [13]. In the latter, each position in the seed pattern is a real value representing the weight of a match or substitution at that position in the total match score, and a seed match is declared when the total score exceeds an a priori fixed threshold. However, these methods increase the memory and running time of searches, both of which are critical for practical highthroughput applications.
While early alignment and seed models used the traditional {0, 1} alphabet, to further improve the seed sensitivity transitiononly wildcards were introduced to differentiate between transitions and transversions, first implemented in blastz [9]. Later, Noé and Kucherov [14] formalized the concept and showed that spaced seeds with transitions extend the sensitivity range of {0, 1} seeds, and Zhou and Florea [11] additionally provided a framework for specificity calculation, and showed that they offer better sensitivityspecificity tradeoffs than {0, 1} seeds in practice.
When comparing coding sequences, such as in cDNAtogenome alignments, the codon organization, higherorder position dependencies [15] and specific transitiontransversion biases [16] inherent in the gene sequences are likely to be reflected in the alignment patterns. In [6,10], a threestate Markov model of order 1 is used to simulate the three codon positions, while Brejová et al. [17] introduce two models, the first a threestate Markov model of order 0, and the second a more complex Bernoulli formulation in which each codon is modeled independently. Recently, we proposed to use higher order (3–5) inhomogeneous Markov models with transitions [11] to capture both transitiontransversion biases and sequence compositional patterns.
As the number of sequenced organisms increases, the range of possible pairwise species comparisons will grow quadratically with the number of species. For practical reasons it becomes desirable to identify a small number of seeds that would perform well for a wide range of comparisons, at varying evolutionary distances. Earlier, Choi et al. [1820] determined and analyzed best seeds for genomic sequence comparisons for several sequence identity levels under a Bernoulli model of alignment, in a greatly simplified model of evolutionary distance. At low weights (10–12), seeds performed optimally or nearoptimally across a wide range of sequence similarity levels. At higher weights, however, there was significant fluctuation in seed sensitivity, which led them to the hypothesis that different seeds will be needed for each type of comparison. This simple plevel representation of evolutionary sequence divergence is likely inadequate for coding sequences, which are under a more diverse set of evolutionary pressures. We approach the question of designing universal good seeds for cDNAtogenome comparisons, i.e. seeds that perform well for a large number of comparisons, starting from a complex representation of alignment as a high order 3periodic inhomogeneous Markov model incorporating transitions [11]. Using comparisons between human and four others species (mouse, dog, chicken and zebrafish) to sample a wide range of evolutionary distances, we analyze the distributions of seed sensitivities statistically to characterize and identify universal good seeds. In the remainder of this section, we provide a brief introduction to the problem of designing optimal spaced seeds.
Spaced Seeds
{0,1} spaced seeds
Alignment programs typically use short exact or approximate matches of an a priori specified pattern (seed) to detect local alignments between two sequences. A continuous seed, such as the one used in blast [3,4], requires an exact match of a fixed length k between the sequences, and is represented as a vector of 1s (11_{c }= 11111111111). In contrast, spaced seeds [7] allow for approximate matches by including wildcard positions in the pattern, marked with 0s (e.g., WABA_{12 }= 11011011011011011). The number of 1 positions in the pattern is called the weight of the seed, and the length of the pattern is called the span. Conventionally, seeds must start with a 1 position. In keeping with previous studies, we will use a fixed seed span k = 22 [6,10,11].
An alignment is represented as a string of 0s (mismatches) and 1s (matches) generated from a model , for instance a Bernoulli or a Markov model. A seed S = s_{1 }... s_{k}, s_{1 }= 1, is said to detect the alignment w = w_{1 }... w_{L }∈ {0, 1}^{L }if there is an approximate match for the pattern in the alignment string such that all 1 positions in the seed pattern map to 1 positions in the alignment, i.e. there exists i = 1..L  k + 1 such that w_{i+l1 }= 1 for all l with s_{l }= 1. If such an i exists, the seed is said to occur in the alignment w at position i. The theoretical sensitivity of a seed is then defined as the probability that it will detect a random alignment of length L generated from the model , or equivalently: Sn(S) = P({w ∈ {0, 1}^{L}S detects w}) [6,8]. Traditionally, the alignment length L is 64, determined as the average length of a gapfree alignment in humanmouse comparisons. By definition, an optimal seed is a seed with the highest sensitivity. For a given seed, its sensitivity can be computed exactly using dynamic programming [8,10,21]. Optimal seeds can then be produced by exhaustively searching the seed space [8], while close approximations can be obtained with fast heuristics, such as hillclimbing [6,10] or exploiting the seed structure to reduce the search space [20].
{0, 1, x} spaced seeds for cDNAtogenome alignment
Unlike genomic sequences, gene sequences exhibit higher order dependencies between positions [15], more pronounced transitiontransversion biases [16], and distinct manifestations at the three codon positions, which are likely to translate into alignment patterns. To account for these characteristics, in previous work [11] we proposed a framework that differentiated between transitions and transversions, by using an additional alphabet symbol x, as well as among the three codon positions, using a third order inhomogeneous 3periodic Markov model of cDNAtogenome alignments. In the seed pattern, x marks a position that allows transitions but not transversions, while in the alignment model it simply represents a transition. The weight of the new symbol is 0.5. We denote (n_{1}, n_{0}, n_{x}) the class of seed patterns with n_{1}, n_{0 }and n_{x }symbols of 1, 0 and x, respectively, n_{1 }+ n_{0 }+ n_{x }= k. For a given weight W, there may be multiple (n_{1}, n_{0}, n_{x}) combinations with n_{1 }+ 0.5·n_{x }= W . For instance, for the weight W = 10 and span k = 22, the following combinations are possible: (10, 12, 0), (9, 11, 2), ..., (0, 2, 20).
In the following sections, we investigate the behavior of seeds, and in particular best seeds, among four comparisons or cDNAtogenome alignment models (humanmouse, humandog, humanchicken and humanzebrafish) to obtain a robust characterization of universal good seeds.
Results
The repertoire of species to be sequenced is expected to increase dramatically over the next few years, driven by new, more effective and increasingly reliable sequencing technologies. As the number and phylogenetic diversity of genomes increases, designing optimized seeds for each pair of compared species quickly becomes impractical. It becomes desirable to identify a limited number of seeds that perform well, if not optimally, for a large number of comparisons. We call these universal good seeds. We examine seeds for four comparisons between species with significantly different evolutionary distances and mutation patterns (humanmouse, humandog, humanchicken and humanzebrafish). For simplicity, we will refer to each comparison by the name of the second organism (DOG, MUS, CHK, ZFS). By analyzing the distribution of seed sensitivities between models statistically, we identify strategies for designing universal good seeds.
Universal good seeds
We address two questions: First, are there groups of comparisons, or equivalently alignment models, that produce similar behavior of all seeds? We call such models seedequivalent, and one optimal seed would then satisfy all comparisons in a seedequivalence group. Second, are there universal good seeds, i.e. seeds that are optimal or nearoptimal for a large range of comparisons and evolutionary distances?
To answer these questions, we started by calculating seed sensitivities exhaustively in the four models for three weights, W = 12, 14, 16, for the (12, 10, 0), (14, 8, 0) and (16, 6, 0) combinations. We then compared the distributions of values between any two models using statistical regression methods to determine seed trends, by measuring the 95% confidence interval for a seed when projected via the regression curve, and to identify outliers, or seeds that have significantly different behavior (at > 2.5σ from the regression curve) between the two models. Although linear and nonlinear regression models showed similar goodness of fit, an order 3 nonlinear method was chosen because it provided a more conservative estimate of predicted values among the top scoring seeds, which are particularly relevant to this study. The scatter plot of seed values between the CHK and DOG distributions and the two regression curves, linear and nonlinear, produced with the R statistical analysis package [22] are shown in Figure 1. Scatterplots for all pairwise comparisons are shown in [Additional file 1].
Figure 1. Scatterplot of seed sensitivity values between the CHK and DOG comparisons, for weight W = 16 and for the (16, 6, 0) combination. The linear and nonlinear regression (solid) and the 95% confidence interval (dotted) curves are shown.
Additional File 1. Scatterplots of seed sensitivity values between pairs of comparisons. This collection of figures shows the scatterplots of seed sensitivity values and the fitted regression curves between any two models in (DOG, MUS, CHK, ZFS).
Format: PDF Size: 944KB Download file
This file can be viewed with: Adobe Acrobat Reader
Determining seedequivalent models
For most pairs of models, the 95% confidence interval for the predicted seed values (2·t_{α/2}σ_{y }in Table 1) is less than or close to 0.01, meaning that with high probability the observed seed sensitivity in the second model is within close vicinity (  t_{α/2}σ_{y}, + t_{α/2}σ_{y}) of the predicted value, . When seed values can be so closely predicted between both (X, Y) and (Y, X), the two models X and Y can be deemed seedequivalent. In our case, DOG and MUS, and CHK and ZFS, respectively, are seedequivalent despite the difference in their sensitivity ranges. Interestingly, seed values appear to also be accurately projected from a reference model X describing a more distant evolutionary relationship, to a compared model Y representing a closer comparison, but not conversely. We explore this observation below.
Table 1. Comparison of seed sensitivity distributions between models. Seed population sizes for (12,10,0), (14,8,0), (16,6,0) are 352716, 203490 and 54264, respectively. Explanation of columns: y_{min, max }– minimum and maximum sensitivity in the compared model; x_{max }– maximum sensitivity in the reference model; t_{α/2}σ_{y }– half the length of the 95% prediction interval (t_{α/2 }= 1.96 when α = 0.05); T(x, y) = ((a + bx_{max }+ + )  t_{α/2}σ_{y})/y_{max}, where a, b, c and d are the coefficients of the regression curve, and σ_{y }is the estimated regression standard error of prediction for a given x value. Because the number of values is large, σ_{y }≃ σ for all y. Outliers are determined as those points that satisfy either of the following two criteria: y  > 2.5σ (U) or  y > 2.5σ (L). Here, = a + bx + cx^{2 }+ dx^{3}.
Determining universal seeds
With the same argument as above, when the margin of error t_{α/2}σ_{y }is small, with high confidence highscoring seeds in the reference model are expected to lie at the top of the sensitivity range in the compared model. Hence, good seeds will largely be shared between the models. To measure how closely the predicted value of the optimal seed in the reference model (x_{max}) approaches the optimal (real) seed in the compared model (y_{max}), we used the statistical lower bound of the predicted interval for x_{max}, as a worst case scenario, and compared it with y_{max}. These ratios are shown as T (x, y) in Table 1. As expected, the values are consistently above 0.98 between the seedequivalent models, but also when a seed in a more distant model is projected onto a closer model, and the trends are more pronounced with the larger seed weights. This level of accuracy in predicting nearoptimal seeds by regression projection, albeit statistical, is comparable with that obtained with the heuristic algorithm proposed in [20]. To rule out the effects of outliers among the topscoring candidates, we identified those seeds that fall significantly outside of the 95% confidence range (Table 1, columns 7 and 8). Although roughly 1–2% of seeds are expected to place outside the predicted range, they are scattered along the sensitivity range for the model.
Collectively, these findings suggest that simply selecting top scoring seeds in the most distant model, in our case ZFS, would lead to good seeds for all other models, and therefore gives a simple strategy for determining universal good seeds.
Evaluation
To probe the universality of seeds, we evaluate seeds optimized for one comparison on the three others. For each comparison, we determine best seeds for most weights of practical interest, W = 10..16, for all (n_{1}, n_{0}, n_{x}) combinations. Table 2 lists the optimal seed under our fixed span model (k = 22) for each weight for the four comparisons, and the complete list including all combinations is in [Additional file 2]. Figure 2 shows the sensitivity maxima that can be achieved by seeds for each combination. For each comparison (DOG, MUS, CHK, ZFS) the maximum overall sensitivity declines steadily as the seed weight increases, at a rate roughly proportional to the percentage of matches in the alignment (p_{d }= 89.0%, p_{m }= 85.0%, p_{c }= 75.5% and p_{z }= 68.7%). Within each weight group, the sensitivity maxima vary, sometimes dramatically, between (n_{1}, n_{0}, n_{x}) combinations: seeds with the largest number of transitions consistently score the lowest, and seeds with a small number of transitions, depending on the species (2–4 for CHK and ZFS, 6–8 for MUS and DOG), perform the best. We hypothesize that the optimal number of seed transitions for each comparison is determined by the transitiontransversion ratio (κ) in the alignment model, and that more transition positions are expected in the optimal seeds as the ratio increases. Thus, DOG (κ_{d }= 1.97) and MUS (κ_{m }= 1.73) are the most likely to see the effects of transitions, while the effects on CHK (κ_{c }= 1.17) and ZFS (κ_{z }= 0.95) will be smaller. This ratio also determines the gain in sensitivity that can be obtained by seeds incorporating transitions compared to {0,1} seeds.
Table 2. Seeds optimized for the CHK, DOG, MUS, ZFS comparisons, for weight W = 10..16, using hillclimbing. For large weights (e.g., W ≥ 16), the fixed span k = 22 may significantly constrain the range of seeds, and therefore the seeds produced under this model may not be optimal in practice.
Additional File 2. Optimal seeds for the CHK, DOG, MUS and ZFS comparisons. This table lists the seeds optimized for the DOG, MUS, CHK and ZFS comparisons, for weights W = 10..16 and for all (n_{1}, n_{0}, n_{x}) combinations, obtained using hillclimbing. For larger weights (e.g., W ≥ 16), the fixed span k = 22 may significantly constrain the range of seeds, and therefore the seeds produced under this model may not be optimal in practice.
Format: PDF Size: 48KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 2. Sensitivity maxima for DOG, MUS, CHK and ZFS comparisons, for seeds of weight W = 10..18. For each weight W, (n_{1}, n_{0}, n_{x}) combinations are shown righttoleft starting with n_{x }= 0 and subsequently increasing n_{x}. Sensitivity drop rates between consecutive weights are shown at the top of the plots.
To validate the universal seeds, we measured the performance of seeds optimized for one comparison on each of the other three, both theoretically and on real data. For empirical evaluations, we tested the ability of each seed to detect orthologous exons between human and each of the other species. Table 3 lists the theoretical (T) and empirical (E) seed sensitivity values, calculated as described in Methods, averaged over each weight group. Theoretical values show very similar sensitivities for all four groups of seeds on all four comparisons. Empirical values are somewhat more varied, but they also indicate that all seeds perform optimally or nearoptimally for multiple comparisons. In all cases, seeds designed for the more distant comparisons have near optimal performance for the closer ones. In particular, ZFS and CHK optimal seeds are widely applicable, and thus are good candidates for universal good seeds.
Table 3. Theoretical (T) and empirical (E) sensitivities of optimal seeds in the four models. Letters C, D, M, Z indicate the CHK, DOG, MUS and ZFS comparisons, respectively. C_{do }represents sensitivity values for the optimal DOG seeds (do) when applied to the CHK (C) model. Values are averaged within each weight group.
Discussion
While significant effort has gone in designing optimal seeds for comparing human and mouse sequences, a remarkably few other comparisons received any attention at all. With more species being sequenced over the next few years, identifying a small number of seeds that would perform optimally or nearoptimally for a large range of comparisons is becoming essential. For genomic sequences, the sequence identity level has been used as a simple measure of evolutionary distance. Since the sequence composition of genes is significantly more complex, best seeds for comparing coding sequences are expected to depend not only on the sequence identity level, but also on the pattern of mutations, in particular transitiontransversion biases. Intuitively, similar alignment models should produce similar behavior of seeds. Thus, one solution to seed proliferation is to group comparisons that exhibit similar behavior of seeds into seedequivalent classes, such that all comparisons in one class are well served by the same optimal seed. Furthermore, it would be even more desirable to identify one or a small set of seeds suitable for a wide variety of evolutionary distances and mutation patterns, herein called universal seeds.
Our statistical regression analyses of seed sensitivities among four comparisons, chosen at various evolutionary distances, showed that for some pairs of comparisons seed behavior can be predicted closely with highconfidence. In particular, it was possible to identify two sets of models that have relatively different sensitivity ranges but very similar seed behavior, and therefore can be deemed seedequivalent. Moreover, even at larger evolutionary distances more distant comparisons are predictive of closer ones. This conclusion is corroborated by several factors, including the length of the prediction interval, the pattern of outliers, and the ratio of predicted to optimal sensitivity (Table 1). For instance, at 2.5σ regression error, all but a few of the outliers perform better than predicted in the distant model compared to the closer one. In other words, distant comparisons contain more information than closer ones. One outstanding question is how to determine whether two models are close enough to be seedequivalent. While we have not yet found a theoretical formulation, we investigated the relationship between alignment Markov models using a conventional distance measure between their probability distributions. The KullbackLeibler Divergence (KLD) [23] can be applied on the space of alignment words = {0, 1, x}^{64 }to produce a distance between the two models [see Additional file 3]:
Additional File 3. Efficient calculation of the KLD for two Markov models. This appendix describes a recursive procedure for calculating the KullbackLeibler divergence between two Markov models efficiently.
Format: PDF Size: 32KB Download file
This file can be viewed with: Adobe Acrobat Reader
KLD(P, Q) represents the relative entropy of P over Q, or the information gain about when P is used instead of Q. The KLD measure is nonsymmetrical, i.e. KLD(P, Q) ≠ KLD(Q, P), and therefore can capture unidirectional relationships. In Table 4, the more distant comparisons contain consistently more information than closer ones. (Note that we have judged comparisons based on the relative sequence similarity levels, for instance DOG (p_{d }= 89.0%) is closer than MUS (p_{m }= 85.0%).) The table also suggests that a possible cutoff for deciding seedequivalence may be set between 1.0 and 2.0. Additional models and analyses will be needed, however, to validate and fine tune this criterion.
Table 4. Distances between models: KLD is the KullbackLeibler Divergence.
Lastly, one natural question that arises is whether universal seeds exist for other types of comparisons, such as between genomic sequences. Our preliminary experiments using a {0, 1} Bernoulli model of alignment [7,19] and four different sequence similarity levels to represent different evolutionary distances (p = 0.65, 0.75, 0.85 and 0.95) indicate that, again, good seeds for the more distant comparisons will perform well on the closer ones [see Additional file 4]. Moreover, for weight 12, p = 0.65, 0.75, 0.85 form a seedequivalent cluster and the optimal seed is shared among the four models, whereas for the larger weights (14, 16) the seedequivalent groups are sparser and no one seed is optimal for all comparisons. These findings are consistent with the observations in [19]. Thus, although more analyses are needed to test it on various models, our simple strategy for selecting universal good seeds may be more widely applicable to a variety of sequence comparison problems.
Additional File 4. Regression analysis of seed sensitivity distributions for genomic sequence comparison. This table contains the results from the statistical regression analysis of seed sensitivity distributions, similarly to Table 1, but for the case of genomic sequence comparisons. The four models are characterized by the sequence identity levels: p = 0.65, 0.75, 0.85, 0.95.
Format: PDF Size: 18KB Download file
This file can be viewed with: Adobe Acrobat Reader
Conclusion
We performed a statistical analysis of seed sensitivities for four speciestospecies cDNAtogenome comparisons, spanning a wide range of evolutionary distances and mutation patterns, with the goal to determine criteria for selecting a small set of seeds that would perform well on a wide range of comparisons. In particular, grouping models that exhibit similar behavior of seeds into seedequivalence classes could significantly reduce the number of optimal seeds. Most important for practical applications, the analyses showed that with high probability optimal and nearoptimal seeds for the most distant available comparison will translate into good seeds for a wide range of comparisons. These insights, and the sets of optimal seeds predicted for the four comparisons and for a wide array of weights, represent a useful resource in guiding the selection of seeds for developing practical applications.
Availability
Material referenced in the paper can be found in the Additional files below and at our website [24].
Methods
Optimal seeds
Given an alignment model, we calculate seed sensitivity in the {0, 1, x} model recursively as described in [11]. For convenience, we include a summary here.
Let be a homogeneous order 2 Markov model of alignment, and a deterministic finite automoton (DFA) that accepts all and only alignment strings that contain a seed match, built with the AhoCorasick algorithm [25]. Here, denotes the set of states, Σ = {0, 1, x} is the alignment alphabet, is the set of state transitions, and q_{0}, q_{a}, q_{f }are the start, accept and fail states. We write if there is a DFA state transition from state q to state q' on the symbol b ∈ Σ. The sensitivity of the seed Sn(S) then is the probability of all alignment words of length L = 64 accepted by the DFA. We calculate recursively the probabilities P(q, t, α), q ∈ , α ∈ Σ^{3 }that the automaton reaches state q after reading t input symbols ending in suffix α randomly generated from the alignment model. Then, the sensitivity of seed S is :
where P_{0}, P_{1 }and P_{2 }are the marginal probabilities of the order 2 Markov model , and ε is the empty word.
To determine optimal seeds, we use a hillclimbing heuristic [6,11], starting from a random seed and swapping any two distinct symbols with the goal to optimize the score locally. If a 'better' seed is found, it becomes the start seed for variations in the next cycle, until there are no changes to the optimal score. The procedure is applied 20 times for each weight W and for each combination (n_{1}, n_{x}, n_{0}), where n_{i }is the number of i symbols in the seed, satisfying n_{1 }+ n_{0 }+ n_{x }= 22 and n_{1 }+ 0.5·n_{x }= W.
Seed evaluation
We evaluate seed sensitivity both theoretically and empirically to identify good seeds for practical applications. To determine the effects of species divergence on the choice of seeds, we analyze comparisons between human and four other species (dog, mouse, chicken and zebrafish), which coarsely sample the range of vertebrate evolutionary distances.
Theoretical seed evaluation
Given a seed and an alignment model, the theoretical seed sensitivity can be calculated recursively as described above. To train the alignment models for the four sets of comparisons, exons in the human genome were determined from spliced genomic alignments of human mRNA and EST sequences, produced with the programs Sim4/ESTmapper [2,26]. A subset of coding exons and their reading frames were then determined by Fastx [27] matches against the SwissProt database [28]. Lastly, match statistics within the coding exon regions were collected from wholegenome alignments downloaded from the UCSC Genome Browser [29] and used to train the alignment models.
Empirical evaluation
This evaluation component tests the ability of a seed to accurately match dog, mouse, chicken and zebrafish coding exons with their orthologs in the human genome. For this purpose, reference sets of orthologous exons were constructed from homologous RefSeq gene pairs [30,31] as follows. Exon coordinates in the human mRNA and in the human genome were determined from Sim4 [26] alignments of the human mRNA sequence on the human genome HG17 produced with the highthroughput program ESTmapper [2]. These coordinates were then projected onto the mRNA of the other species via the pairwise mRNAmRNA alignment. Starting with roughly 500 gene pairs for each comparison, this procedure produced 2,408 (dog), 4,198 (mouse), 4,869 (chicken) and 4,543 (zebrafish) exon pairs for use in our empirical analysis.
During the evaluation, dog, mouse, chicken and zebrafish coding sequences are searched against the human genome. The search program scans the input sequences, looking for seed matches of length 22 in the human genome. For efficiency, the search uses a bitencoded index of 22 bp words (22mers) in the genome, from which noninformative bits corresponding to 0 or x positions in the seed are removed. Empirical sensitivity is calculated as the fraction of human exons in the reference set that are detected by the seed: Sn^{e }= TP/(TP + FN) [32].
Statistical analysis
We compare the distributions of seed sensitivities between any two comparisons (DOG, MUS, CHK and ZFS) using statistical regression methods. We used a ttest to determine the 95% confidence interval (  t_{α/2}σ_{y}, + t_{α/2}σ_{y}) for the projection = R(x) of a sensitivity value x of a seed from the reference model to the compared model. Here, α = 0.05, t_{α/2 }= 1.96, and σ_{y }is the estimated regression standard error of prediction for a given x value. Because the number of values is large, σ_{y }≃ σ for all y. A nonlinear regression method with an order 3 polynomial was applied to the data: R(x) = (a +bx_{max }+ + ). Lastly, outliers were determined as data points x whose values y in the compared model fall below or above the projected interval: y  > 2.5σ or  y > 2.5σ.
Authors' contributions
LZ performed the analyses and contributed to the drafting of the manuscript. JS provided computational support and critically reviewed the manuscript. LF designed the experiments, directed the project, and drafted the article. All authors have read and approved the final version of the manuscript.
Acknowledgements
Sensitivity calculations were performed on the "Herd" Scientific Computing Cluster at The George Washington University (NSF grant CLS20163A to JS). This work was supported in part by a Sloan Research Fellowship to LF.
References

National Human Genome Research Institute, Listing of Genome Sequencing Proposals [http://www.genome.gov/10002154] webcite

Florea L, Di Francesco V, Miller J, Turner R, Mobarry C, Yao A, Harris M, Walenz B, Dew I, Merkulov G, Charlab R, Deng Z, Istrail S, Li P, Sutton G: Gene and alternative splicing annotation with AIR.
Genome Res 2005, 15(1):5466. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool.
J Mol Biol 1990, 215(3):403410. PubMed Abstract  Publisher Full Text

Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: A new generation of protein database search programs.
Nucleic Acids Res 1997, 25(17):33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kent W, Zahler A: Conservation, regulation, synteny, and introns in a largescale C. briggsaeC. elegans genomic alignment.
Genome Res 2000, 10(8):11151125. PubMed Abstract  Publisher Full Text

Buhler J, Keich U, Sun Y: Designing seeds for similarity search in genomic DNA.
Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003) 2003, 7:6775.

Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search.
Bioinformatics 2002, 18(3):440445. PubMed Abstract  Publisher Full Text

Keich U, Li M, Ma B, Tromp J: On spaced seeds for similarity search.

Schwartz S, Kent W, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W: Humanmouse alignments with BLASTZ.
Genome Res 2003, 13(1):103107. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Buhler J, Keich U, Sun Y: Designing seeds for similarity search in genomic DNA.

Zhou L, Florea L: Designing sensitive and specific spaced seeds for crossspecies mRNAtogenome alignment.
J Comput Biol 2007, 14(2):113130. PubMed Abstract  Publisher Full Text

Sun Y, Buhler J: Designing multiple simultaneous seeds for DNA similarity search.
J Comput Biol 2005, 12(6):847861. PubMed Abstract  Publisher Full Text

Brejová B, Brown D, Vinar T: Vector seeds: an extension to spaced seeds.

Noé L, Kucherov G: Improved hit criteria for DNA local alignment.
BMC Bioinformatics 2004, 5:149. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA.
J Mol Biol 1997, 268(1):7894. PubMed Abstract  Publisher Full Text

Nei M, Kumar S: Molecular evolution and phylogenetics. New York: Oxford University Press; 2000.

Brejová B, Brown D, Vinar T: Optimal spaced seeds for homologous coding regions.
J Bioinform Comput Biol 2004, 1(4):595610. PubMed Abstract  Publisher Full Text

Choi K, Zhang L: Sensitivity analysis and efficient method for identifying optimal spaced seeds.

Choi K, Zeng F, Zhang L: Good spaced seeds for homology search.
Bioinformatics 2004, 20(7):10531059. PubMed Abstract  Publisher Full Text

Preparata F, Zhang L, Choi KP: Quick, practical selection of effective seeds for homology search.
J Comput Biol 2005, 12(9):11371152. PubMed Abstract  Publisher Full Text

Kucherov G, Noé L, Roytberg M: A unifying framework for seed sensitivity and its application to subset seeds.

The R Project for Statistical Computing [http://www.rproject.org] webcite

Cover T, Thomas J: Elements of Information Theory. New York: John Wiley & Sons, Inc; 1991.

George Washington University, Computational Molecular Biology Lab [http://dna.cs.gwu.edu] webcite

Aho A, Corasick M: Efficient string matching: an aid to bibliographic search.

Florea L, Hartzell G, Zhang Z, Rubin G, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence.
Genome Res 1998, 8(9):967974. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pearson W, Lipman D: Improved tools for biological sequence comparison.
Proc Natl Acad Sci USA 1988, 85(8):24442448. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu C, Apweiler R, Bairoch A, Natale D, Barker W, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin M, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information.
Nucleic Acids Res 2006, (34 Database):D187191. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

UCSC Genome Browser [http://genome.ucsc.edu] webcite

Pruitt K, Tatusova T, Maglott D: NCBI Reference Sequence (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins.
Nucleic Acids Res 2007, (35 Database):D6165. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wheeler D, Barrett T, Benson D, Bryant S, Canese K, Chetvernin V, Church D, DiCuccio M, Edgar R, Federhen S, Geer L, Kapustin Y, Khovayko O, Landsman D, Lipman D, Madden T, Maglott D, Ostell J, Miller V, Pruitt K, Schuler G, Sequeira E, Sherry S, Sirotkin K, Souvorov A, Starchenko G, Tatusov R, Tatusova T, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information.
Nucleic Acids Res 2007, (35 Database):D512. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Burset M, Guigo R: Evaluation of gene structure prediction programs.
Genomics 1996, 34(3):353367. PubMed Abstract  Publisher Full Text