Abstract
Background
Largescale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches.
Results
We present here the local pairwise aligner STELLAR that has full sensitivity for εalignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments.
Conclusions
STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar webcite. The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de webcite.
Introduction
Computing good local alignments is a fundamental problem in bioinformatics. By looking for local alignments of biological sequences one aims for example at identifying homologous regions, i.e. regions that are assumed to originate from the same ancestral sequence, or at finding functionally similar sequences. The problem has been studied for more than 30 years [1,2], but still remains interesting. In the beginning, local alignments were used to look for homologous regions in relatively short protein or nucleic acid sequences. Also, for a long time, local alignments have been used to identify conserved, functionally related elements. More recently, local alignments were applied on a genomic scale as prerequisite to global genomic alignments. For several reasons genomic scale alignments are usually not collinear and hence one has to resort to computing local similarities. Now the aim is not anymore to identify some homologous regions but rather to display all similarities between two or more genomic sequences [37]. This requires not only efficiency in computation to process very long sequences, but also accuracy regarding sensitivity, i.e. exact tools that do not miss regions of significant local similarity.
For the computation of local alignments numerous tools have been developed: Early tools such as SSEARCH [2] and FASTA [16] are sensitive but too slow for largescale sequence comparison. Then, there are efficient heuristics, with the BLAST family [1719] being the most prominent example. Further developments for specific largescale analyzes resulted in tools like BLAT [20] which was designed for high speed, and BLASTZ [21] which was designed for higher sensitivity. The more recently published tool BWTSW [22] again focuses on sensitivity and is able to report all local alignments.
To assess homology of biological sequences by local alignment, generally some kind of similarity criterion is necessary. The most widely accepted criterion is the Evalue[18,23], a probabilistic measure that assesses the significance of a local alignment. The Evalue denotes the expected number of local alignments with a minimal score occurring by chance in the input sequences. The score underlying the Evalue is a SmithWatermanlike score, most commonly using affine gap costs. All popular alignment tools that report Evalues, e.g. BLAST, compute such a score, and afterward apply an Evalue threshold on their output. In this paper, we describe a method that follows an alternative approach to compute only significant local alignments of nucleic acid sequences: We compute only highscoring alignments, which are guaranteed to have a good (low) Evalue. We use a maximal error rate for local alignments (normalized edit distance) as a score threshold, and additionally require a minimal alignment length. Our method is specialized on relatively low error rates, which in turn justifies the use of edit distance instead of affine gap costs. For a given error rate and a minimal alignment length our method is exact, i.e. it identifies all local alignments without loss of sensitivity. We compute an Evalue for all generated local alignments and a minimal Evalue for the input parameters. The method is implemented in the program STELLAR (SwifT Exact LocaL AligneR) using the SeqAn C++ library [24,25]. The program depends only on very few and clearly understandable parameters. We prove that our algorithm is exact for all reasonable parameter settings and confirm this experimentally. We compare STELLAR against popular local alignment programs, namely BLAST, LASTZ, BLAT, and BWTSW in terms of speed and sensitivity and show that some of the tools miss many significant local alignments that can be detected with STELLAR.
Methods
Definitions and overview of algorithm
A pairwise local alignment of length n is a sequence of n match, insertion, deletion, and substitution columns. In our approach deletion, insertion, and substitution columns are all treated equally. Hence, we will call these columns error columns. The number of error columns is the edit or Levenshtein distance of a local alignment. Normalizing this distance by dividing it by the length of the local alignment, we obtain an error rate. An εmatch is a local alignment that has an error rate of at most ε > 0 and length n ≥ n_{0}. Fig. 1 shows two examples of εmatches as segments of a longer local alignment.
Figure 1. Overlapping εmatches. An alignment of two strings containing two overlapping εmatches for ε = 0.1 and n_{0} = 20. The εmatch indicated by the dashed box has an error rate of 2/22, the εmatch indicated by the box with a continuous line one of 2/25. The union of these two εmatches from position 4 to position 31 is not an εmatch: the error rate is 3/28 > 0.1. Furthermore, the intersection of the two εmatches is with 19 columns too short to be an εmatch.
The notion of an X – drop to delineate local alignments from each other is well established by Miller and coworkers in the context of similarity alignments [18,26,27]. An Xdrop within an alignment, where X > 0 is given, is a region of consecutive columns with a total score of –X or less. In other words, it is a region where the score drops by X or more. In Fig. 2 we display an example in which we score error columns by –1. The Xdrop is a very intuitive way to model local dissimilarity and hence we choose to adopt the concept for our model of local similarity. Since we address the computation of εmatches in this paper we propose the following scoring scheme that depends on the error rate ε: We score a match by +1 and an error by p = –1/ε + 1. In addition, we adjust the score dropoff parameter by multiplying it with the negative of the error penalty –p such that the user specified parameter X still corresponds to the number of errors in a matchfree Xdrop region and is easy to grasp for the user (Fig. 3). To emphasize the difference from the usual understanding of an Xdrop we call this weighted Xdrop an εXdrop.
Figure 2. Xdrop. An alignment containing an Xdrop for X = 3. In this example, an εmatch with ε = 0.1 and n_{0} = 20 (indicated by the box) spans the Xdrop.
Figure 3. εXdrop. An alignment containing an εXdrop for X = 3 and ε = 0.1. To reach an εXdrop with a score dropoff of at least , a fourth error column is necessary in this example because of the positively scoring matches in between the errors.
The goal of this work is to find εmatches of two sequences without εXdrops. Often, εmatches overlap to a large extent (see Fig. 1 for an example), or segments of εmatches are themselves εmatches (e.g. we obtain an εmatch by removing a column from one end of the εmatches in Fig. 1). Thus, the number of εmatches can be very large with very redundant similarity information. We handle this issue by only identifying the longest εmatch of one location: If two εmatches overlap, we output only the longer one unless the overhanging part of the shorter εmatch still has a minimal length of n_{0}. In that case we output both complete εmatches. Thus, for the example in Fig. 1 we only output one εmatch. We say that such an εmatch is maximal.
We now present an algorithm to compute exactly all maximal εmatches without εXdrop. Our algorithm runs in two phases: filtering and verification. The filtering phase implements the SWIFT algorithm [28], a very efficient fullsensitivity filter for εmatches. Note that SWIFT is a filter algorithm and does not output a list of εmatches. While it guarantees not to miss any maximal εmatch and hugely reduces the search space, a verification phase is necessary to identify false positive hits of the filter. Furthermore, verification is needed to determine the exact start and end positions of maximal εmatches. We have developed a verification strategy that runs in five steps: εcore identification, εXdrop core filter, εXdrop extension, identification of maximal εmatches, and filtering of overlapping matches. Verification may stop after any of these steps if it is clear that we will not identify a new εmatch without εXdrop. The strategy guarantees to find all maximal εmatches without εXdrop.
A similar twostep algorithm that consists of SWIFT filtering and subsequent verification is implemented in the read mapper RazerS [14]. The difference to RazerS is, however, that we are looking for εmatches that are local in both sequences whereas read mappers compute semiglobal alignments, i.e align the full read sequence to a reference. RazerS uses a slightly modified SWIFT filtering, and the verification is much simpler since the length of the final alignments is preset by the read length.
Filtering phase
The SWIFT algorithm proposed by Rasmussen et al. is an efficient qgram based filter to detect potential εmatch regions between two sequences. It is based on the qgram lemma [29,30]. This lemma states that every alignment of length n with k error columns contains at least T(n, k, q) := n + 1 – q(k + 1) qhits, substrings of q consecutive match columns. Considering the dotplot of two sequences, every qhit corresponds to a diagonal stretch of matches with length q. Obviously, all qhits of an alignment with k errors can cover at most k + 1 different diagonals in the dotplot.
Rasmussen et al. proved that for any given ε and n_{0} there exist w, q, e and τ such that every εmatch contains τ qhits that reside in a w × e parallelogram. A w × e parallelogram is the intersection of e + 1 consecutive diagonals and w + 1 consecutive columns in the dotplot.
To detect w × e parallelograms with τ qhits in the dotplot, the SWIFT algorithm slides from left to right over one sequence and searches overlapping qgrams in a qgram index of the other sequence. Found qhits are counted in bins of Δ + e consecutive diagonals whose first diagonal is a multiple of Δ. As adjacent bins share e diagonals, every w × e parallelogram is contained in one bin. Every bin contains a qhit counter and represents the parallelogram with columns bounded by the leftmost and rightmost contained qhit. If a qhit is found that is at most w – q columns apart from the rightmost qhit, the parallelogram is extended. Otherwise it is closed and a new one starting at the found qhit is opened as the two hits cannot be part of the same w × e parallelogram. A closed parallelogram whose bin counter has reached τ is output as a SWIFT hit and verified as described in the following section.
Fig. 4 shows examples for SWIFT hits containing either a subalignment of an εmatch, whole εmatches, no εmatch or an εmatch with an Xdrop.
Figure 4. Example SWIFT hits. Example SWIFT hits for n_{0} = 20, ε = 0.1, and q = 6 (= s^{min}). Accordingly, w = 20, τ = 3, and e = 2. SWIFT searches parallelograms that contain at least τ = 3 qgram hits by streaming over sequence 1 and searching common qgrams in sequence 2. Subfigure (a) shows an εmatch that results in two SWIFT hits and the εmatch is longer than both of the two hits. (b) shows a SWIFT hit that contains two εmatches and (c) shows a false positive SWIFT hit induced by three separated qgram hits. (d) shows a SWIFT hit that contains an εmatch with a 3drop.
Verification phase
Fig. 4 demonstrates that the output of the filtering phase is not yet a list of εmatches, although the SWIFT algorithm hugely reduces the search space. SWIFT hits may contain one or more εmatches, but may as well be false positive and contain no εmatch. Some SWIFT hits may overlap and contain the same or parts of the same εmatch. Further, they may be much longer than a contained εmatch, or they may cover only parts of an εmatch. Therefore, we have developed the following verification strategy.
We start verifying SWIFT hits by identifying a segment of an εmatch that overlaps with the SWIFT hit. We call such a segment an εcore. We guarantee not to miss any εmatch by identifying all εcores contained in a SWIFT hit. εcores will then serve as starting points for extension, possibly beyond the ends of SWIFT hits. Finally, we cut the longest εmatches from extended εcores and remove overlapping εmatches.
Definition and existence of εcores
Under the simple scoring scheme where a match scores +1 and an error p = –1/ε + 1, we define an εcore of an εmatch as a segment with a score of at least:
where n_{0} is the minimal length of an εmatch and is the next larger length that allows one more error than n_{0}.
In the following two lemmata we prove the correctness of our approach that starts verification from εcores.
Lemma 1. Every εmαtch contains at least one εcore.
Proof. is the maximal number of errors in an εmatch of length n. Thus, the number of matching positions in an εmatch of length n is at least . These matching positions can be split by the errors of the εmatch into at most errorfree segments. If the errors are spread evenly over the εmatch, at least one of the errorfree segments has length ≥ l(n, ε), where . Some thought reveals that any other distribution of the errors would result in at least one longer errorfree segment. Therefore, an εmatch of length n contains at least one εcore of length ≥ l(n, ε). Unfortunately, l(n, ε) is a sawtooth function, i.e. l(n, ε) is not monotonically increasing in n (Fig. 5). Hence, one cannot use l(n_{0}, ε) as a bound for all n ≥ n_{0}. The function drops to a minimum at each point . However, it is easy to confirm that the minima are strictly increasing, i.e. each successive minimum of the sawtooth is higher than the previous one. Therefore, the smallest value of l(n, ε) over all n ≥ n_{0} is the minimum of l(n_{0}, ε) and l(n_{1}, ε). This is exactly the definition of s^{min}, i.e. an εmatch contains at least one errorfree segment of length s^{min} which is an εcore. □
Figure 5. Sawtoothed function. Plot of the sawtoothed function for n_{0} and with ε = 0.05.
Lemma 1 settles the existence of an εcore for every εmatch. Unfortunately, the SWIFT filter only guarantees to report SWIFT hits that overlap a part of each εmatch. Hence, in principle, the SWIFT hit could not contain an εcore, which in turn could make our algorithm miss the εmatch. In the next lemma we will show that for a certain value of the parameter q this is never the case.
Lemma 2. The intersection of a SWIFT hit with an εmatch contains at least one εcore if q := s^{min}.
Proof. By definition, the intersection of a SWIFT hit with an εmatch contains at least τ qgrams interspersed by at most e errors. Therefore, every SWIFT hit contains at least one segment of at least consecutive qgrams. The length of this segment is at least . Because τ > 0 and e ≥ 0 the first summand is greater or equal one, so we obtain . Thus, if we set q := s^{min} every SWIFT hit contains at least one segment of length s^{min}, which is an εcore. □
Step 1: εcore identification
In our verification strategy, we identify εcores by applying a banded version of the WatermanEggert local alignment algorithm [31]. The original algorithm computes all nonoverlapping local alignments that reach a specified minimal score under a certain scoring scheme by dynamic programming (DP). We use the scoring scheme that scores matches by +1 and errors by p = –1/ε + 1 and set the minimal score to s^{min}. In our version, we reduce running time and space requirements of the algorithm by banding the computation of the DP matrix according to the parallelogram shape of SWIFT hits (see Fig. 4, only the shaded parts of the alignment matrix need to be computed). Thus, per εcore there is a maximum running time of where w′ is the length and e the width of the corresponding SWIFT hit.
Since the WatermanEggert algorithm reports only nonoverlapping local alignments one may think that some εcores will not be identified because they are hidden by longer local alignments that reach beyond the end of an εmatch. However, this can only be for nonmaximal εmatches since we have chosen the scoring parameters such that εcores extended by additional local alignment columns only have higher scores if the additional columns have themselves an error rate of at most ε (Fig. 6). This is why all maximal εmatches will also include the additional columns, i.e. no local alignment will reach beyond the end of an εmatch.
Figure 6. Extended εcore. An εmatch extended by one error column and match columns, is still an εmatch. Therefore, an εcore should include such an extension, since all maximal εmatches at this location will include it. In this example where ε = 0.1 and n_{0} = 20 we see an errorfree segment of length 7 ≥ 6 = s^{min} that can be extended to an εcore of length 19 including one error.
Step 2: εXdrop core filter
The second step of our verification strategy is a filter for εXdrops in the εcores. In the previous step we ignored εXdrops in εcores. If now one of our εcores contains an εXdrop, the εcore should be divided into two cores in order to remove the εXdrop. Similarly, if an εcore contains more than one εXdrop, the εcore should be divided into even more cores.
For this decomposition of the εcores, we apply the postprocessing algorithm from Zhang et al. [27] with the same scores and penalties for matching and error positions as in our εcore identification step. The worstcase running time of this algorithm is linear in the length of the εcore.
Possibly, we obtain more than one εcore after this step, but in any case the following extension step has to be conducted only to the left and right of the original nondecomposed εcore. If we started with an εcore with more than one εXdrop, we can skip the following extension step for the middle parts, since the extension algorithm would run immediately into the previously detected εXdrops.
Step 3: εXdrop extension
The goal of the extension step is to obtain a region that spans all εmatches without εXdrop containing the εcore. In this region, the extended εcore, we can then look for the maximal εmatch. Clearly, we can discard extended εcores that are shorter than n_{0}.
For extension we apply the gapped extension algorithm by Zhang et al. [19] with the εadjusted scoring parameters as before. This algorithm is a scoreonly algorithm, i.e. it reports only the score and the sequence positions of the maximal extension but not the precise alignment. However, it reports the maximal and minimal diagonal of the alignment matrix (a band) that needs to be computed when looking for the precise alignment. We will determine the alignment in the next step of the verification strategy together with the exact begin and end positions of the maximal εmatch.
It is hard to do an informative running time analysis for this step. In theory, the dynamic programming algorithm could fill big parts of the alignment matrix. However, for very similar sequences only a narrow diagonal stretch will be filled, and for very distinct sequences we will soon reach an Xdrop and stop. Still, we can estimate the running time by where L is the length of the extension and b is the width of the band.
It is easy to confirm that if an εcore is part of an εmatch without εXdrop, then the extended εcore spans this εmatch without εXdrop.
Step 4: identification of maximal εmatches
The remaining task is to determine the longest alignment in the extended εcore that has an error rate of at most ε. More precisely, we are looking for the longest extension to the left and to the right of the εcore such that the complete alignment has an error rate of at most ε. The maximal error rate (i.e. the number of errors and length) that we can allow for the extension to the left depends not only on the error rate of the εcore but also on the error rate of the extension to the right, and vice versa. Therefore, we cannot determine the lengths of the right and left extension separately. Furthermore, depending on the length of the extension the optimal trace through the alignment matrix can differ (Fig. 7). Our suggestion is to compute for all possible extension lengths the optimal end position of a trace in the alignment matrix, then to determine the optimal lengths of the right and left extension, and lastly to carry out the traceback. The details of these three steps are described in the following.
Figure 7. Extension traces of an εcore. Alignment matrix with two possible traces of the extension of an εcore. Depending on the length of the extension (red numbers), we obtain the lowest error rates by aligning different sequence positions to each other. For an extension length of 5, 7, and 9 it is better to follow the upper trace, whereas for a length of 11 the lower trace has fewer errors. For all other lengths there is a longer trace with the same number of errors, and therefore it is not necessary to consider those lengths while looking for maximal εmatches.
The computation of the optimal end position of a trace for all possible extension lengths can be done along with the computation of the alignment matrix. Unfortunately, two traces of different lengths may end in the same column of the alignment matrix. For this reason, it is necessary to compute the alignment matrix using an algorithm for normalized alignment score [32,33], which in addition iterates over all alignment lengths. As already mentioned, the alignment matrix can be banded by the minimal and maximal diagonal from the seed extension algorithm.
Along with the alignment matrix computation we check and if necessary update in each iteration step a list bestEnds with the best alignment matrix entry for the corresponding extension length. Each entry consists of the length and score (or number of errors) of the alignment, and coordinates in the alignment matrix. This list can afterward be reduced to a smaller set of possible lengths using the following observation: The position before the start and the position behind the end of the sought εmatch is an error, otherwise the εmatch would not be maximal. Hence, we only need to keep those list entries for lengths l where the score difference of bestEnd[l] and bestEnd[l + 1] is smaller than the score of a match. As a result we obtain two lists with possible traceback starting points, one for start positions of the εmatch obtained from the left extension, and one for end positions obtained from the right extension.
On these lists we then apply the following exhaustive search algorithm that iterates over combinations of possible start and end positions: We start with the leftmost possible start position and iterate over possible end positions from right to left. We continue with the next possible start position and restart with the rightmost possible end position as soon as the segment between the current start and end position has an error rate of at most ε (update currently longest εmatch), or if this segment is shorter than the minimal εmatch length or our currently longest εmatch (do not update currently longest εmatch). The algorithm stops when the segment between the current start position and the rightmost possible end position is shorter than the minimal εmatch length or the currently longest εmatch. Using this strategy we cannot miss the longest εmatch without εXdrop if the εcore is part of one.
In case there is another maximal εmatch containing the εcore, we have to recurse this search twice with the lists reduced by the following entries: All start positions before the start position of the longest εmatch and all end positions that are smaller than n_{0} added to the end position of the longest εmatch; and all start positions before the start position of the longest εmatch minus n_{0} and all end positions behind the end position of the longest εmatch.
As a last step, we have to look up the coordinates for the optimal extensions in the bestEnd lists and start traceback from these positions in the alignment matrices. The εcore extended by the resulting alignments is a maximal εmatch that contains the εcore.
The running time for computing the alignment matrix using a normalized alignment score is in where b is again the width of the band and L is the length of the extension. Dropping the normalization of the alignment score reduces running time by a factor of L. Determining the optimal start end end position in the left extension of length L_{1} and right extension of length L_{2} takes time per εmatch, and finally, the traceback takes time linear in the length of the final εmatch.
Step 5: removal of largely overlapping εmatches
An εmatch identified during the verification phase is the longest that contains one specific εcore but it is not necessarily maximal in that it does not overlap with a longer εmatch. In addition, an εmatch containing two εcores will be identified twice. To ensure that we output each εmatch only once and also only maximal εmatches, this last step is necessary.
We remove overlapping εmatches by sorting the εmatches by their begin position in one sequence and pairwisely comparing here overlapping matches further. If two εmatches are found to be identical, one is discarded. Also, if the shorter of the two εmatches has no unique part of length n_{0}, this εmatch is discarded. The running time of this last step is dominated by sorting the εmatches, i.e. it is in , where M is the number of εmatches before removal.
Theorem.Let M be the set of maximal εmatches without εXdrop between two sequences. Then the algorithm that uses SWIFT for filtering and the described strategy for verification will detect exactly all εmatches in M.
Proof. The SWIFT filter algorithm guarantees to report at least one overlapping SWIFT hit for every εmatch of the input sequences. The first step of the verification strategy detects all εcores in SWIFT hits. Apply Lemma 1 to prove that every εmatch contains an εcore. According to Lemma 2, one of the εcores of every εmatch is contained in a SWIFT hit. Thus, for every εmatch in M the first verification step identifies at least one εcore.
Let C′ be the set of εcores identified during the first step, and let C ⊆ C′ be the subset of εcores that are part of an εmatch in M. Since none of the εmatches in M contain an εXdrop, the local alignment obtained after εXdrop extension (Step 3) of an εcore c ∈ C spans the corresponding εmatch.
By cutting the extended εcores as described in Step 4, we eventually end up with a set of εmatches M′ that each contains a certain εcore. Step 4 also guarantees that per εcore no longer εmatch exists than the εmatch in M′. Therefore, after removal of overlapping εmatches (Step 5), our set of εmatches contains exactly all maximal εmatches without εXdrop. □
Results and discussion
We have implemented the algorithmic pipeline in the program STELLAR following exactly the above described steps with one exception: To improve running time, STELLAR computes the alignment matrix during the identification of the longest εmatch with unnormalized alignment score. The following results show that this has in practice no effect on the sensitivity.
We tested STELLAR on simulated and on real genomic data and compared its performance to BLAST [18], LASTZ as the replacement of BLASTZ [21], BLAT [20], BWTSW [22], and SmithWaterman alignments obtained with SSEARCH from the FASTA package [16]. In addition, we ran BLAST with a more sensitive parameter setting: According to Lemma 1, every εmatch contains a seed of length s^{min}, and therefore, we set BLAST’s wordsize parameter to the corresponding s^{min} computed in STELLAR. To demonstrate the differences between the programs we compared speed and sensitivity on all data sets. Running times were measured on a 2.66 GHz Intel Xeon X5550 with 72 GB of RAM running Linux. Running times of BWTSW include preprocessing of the database sequence. In all test runs STELLAR needed less than 1 GB of RAM, so we omit further details of memory usage. As a measurement for sensitivity we computed the percentage of matches from a reference set that were sufficiently covered by matches from the respective program. We say that matches that are covered by less than 10% are missed (Fig. 8). This is a very loose criterion which is in favor of the compared programs.
Figure 8. Match coverage. A set of matches A = {A_{1}, …, A_{8}} that are to be compared to a set of reference matches {B_{1}, …, B_{7}}. We say that a match B_{i} is covered by the matches from A if at least 10% of the alignment columns agree between B_{i} and any match from A.
Simulated sequences
To demonstrate STELLAR’s gain of sensitivity in comparison to other programs, we used simulated data sets. The advantage here is that the reference set of local alignments for the computation of the sensitivity is given. We simulated random sequences with uniformly distributed characters from the alphabet {A, C, G, T}. In addition, we simulated random local alignments of length 50200 bp and inserted errors at different rates into the alignments. In order to see at what error rates the programs start missing local alignments, we created a first data set where 500 such local alignments with an error rate of 0%, 2.5%, 5%, 7.5%, or 10% were inserted into a sequence of length 1 Mb at random positions. A second simulation was conducted to assess the effect of the sequence length. Here, sequences of lengths 1 kb, 10 kb, 100 kb, 1 Mb, and 10 Mb were simulated containing local alignments with error rates between 0 and 10%. On these data sets we ran the above mentioned programs, with STELLAR’s error rate parameter set accordingly and the minimal match length set to 50. The results are shown in Tables 1 and 2. Table 1 demonstrates that STELLAR, in particular for the higher error rates, outperforms the other programs. SSEARCH has full sensitivity as expected but is very slow. We confirmed that with an Evalue cutoff of 0.01 it does not detect any other than the inserted local alignments. BWTSW reports all local alignments, too, but is still much slower than STELLAR. For BLAST and LASTZ the number of missed matches is low for very low error rates but increases with higher error rates. This implies that one can benefit the most from STELLAR when comparing closely related sequences that still have significant differences. As an example, Fig. 9 displays one εmatch that only STELLAR, BWTSW, and SSEARCH identify. BLAST is the fastest of all programs, though only with default parameters and lower sensitivity. BLAT constantly misses around 30% of all matches. We assume that the reason for BLAT’s bad performance is that it was originally designed for the comparison of many short sequences (ESTs or reads) against one long sequence and not for the comparison of two long sequences. Table 2 supports this assumption, as the number of matches missed by BLAT is low for the 1 kb – 100 kb sequences but increases up to almost 70% for sequences of length 10 Mb. In contrast, the sensitivity of BLAST and LASTZ seems not to be affected by different sequence lengths. STELLAR is in general faster than BWTSW, BLAT, and LASTZ with one exception – the 10 Mb sequences. This indicates already a limitation of STELLAR on very long sequences with high error rates. The SWIFT filter has a lower specificity for high error rates and generates very many SWIFT hits. As a result, many verification steps are necessary, which leads to an increase in running time. However, STELLAR is faster than the sensitive BLAST for the 10 Mb sequences and also for high error rates. The sudden increase in running time for the sensitive BLAST at higher error rates (Table 1) is due to a much lower s^{min} for ε = 10%.
Table 1. Running times and sensitivity on simulated sequences containing local alignments of different error rates
Table 2. Running times and sensitivity on simulated sequences of different lengths
Figure 9. εmatch in simulated data. An example εmatch from the simulated sequences of length 1 Mb and ε = 10%. This εmatch with an Evalue of 7 × 10^{–26} is only found by STELLAR, BWTSW, and SSEARCH, but by none of the other programs.
Genomic sequences
We downloaded the assembled genomes of Drosophíla melanogaster (release 5.26) and Drosophíla pseudoobscura (release 2.14) from FlyBase [34]. We selected chromosome arm 2L from D. melanogaster (~23.5 Mb) and group 3 from the chromosome 4 assembly of D. pseudoobscura (~11.7 Mb) for our test runs. Unfortunately, this data set is too big to compute local alignments with SSEARCH, and since BLAST performs better than BLAT and LASTZ on the simulated data we chose to compare STELLAR on the genomic data only to BLAST. We expect BLAST to compute very short alignments with low error rates and some long alignments with higher error rates that do not fulfill the minimal length or error rate criterion for εmatches, and hence STELLAR will not find them. Therefore, to doublecheck STELLAR’s sensitivity, we filter all εmatches from the set of BLAST hits. Additonally, some of the longer BLAST hits may contain valid εmatches that we extract and add to the set of filtered BLAST hits.
Results for STELLAR are shown in Table 3 and results for BLAST in Table 4. STELLAR identifies for example 345 εmatches with an error rate of 10% and a minimal match length of 200. As expected, these cover all of the εmatches filtered from the set of BLAST hits. In contrast to the simulated sequences, the running time increases significantly with a higher error rate or lower minimal length. This can be explained with the filter algorithm being more specific for higher minimal lengths and low error rates as already mentioned above. With less specific filtering, many more SWIFT hits need to be verified resulting in a higher running time. In the future we might be able to reduce this effect by parallelization.
Table 3. Results of STELLAR on drosophila chromosomes
Table 4. Results of BLAST on drosophila chromosomes
BLAST with default settings is again the faster program, but misses 17 εmatches of minimal length 200. One of these matches is displayed in Fig. 10. When we change the word size parameter such that BLAST is able to identify all εmatches of minimal length 200, it is slower than STELLAR. STELLAR with minimal length 100 and error rate 10% is slowest in all tested settings, but identifies 408 εmatches that BLAST with default parameters does not find, and 13 εmatches that even the more sensitive setting of BLAST does not find though these εmatches have an Evalue of 6.1 x 10^{–23} or lower.
Figure 10. εmatch between drosophila sequences. An example εmatch from the drosophila sequences with ε = 10% and n_{0} = 200. This εmatch with an Evalue of 6 × 10^{–84} is not found by BLAST with default parameters.
Conclusions
We presented STELLAR, an algorithm to compute all local alignments of a minimal length according to a clear quality definition using the established measures error rate and Xdrop. STELLAR brings exact local alignments to the community at the speed of heuristic stateoftheart tools like BLAST, BLAT, or LASTZ. In addition, our experiments show that our effort is worthwhile since the heuristic tools miss up to about a third of the matches using simulated and real genomic data. Compared to its closest competitor, BWTSW, it is in most benchmarks faster and offers with the Xdrop parameter a possibility to exclude local alignments with bad regions. A limitation of STELLAR is that only εmatches up to a certain error rate can be computed since the filtering phase loses specificity with increasing error rate. Therefore, for longer and less similar though significant local alignments BLAST remains more appropriate.
As an outlook another relatively new application for local alignments that has emerged with the advent of cheap nextgeneration sequencing should be mentioned. Standard read mapping programs [814] usually can only map entire reads to the reference. With the increasing read length, there will be more reads that span breakage points, e.g. translocations, gene fusions, or splice junctions. The application of an efficient and exact local alignment program could be one way to successfully map such reads and detect variation [15]. Application of STELLAR is especially promising in that it uses the error rate for sensitivity control, an established criterion for read mappers. With a downstream chaining procedure of the partial read matches detected by STELLAR, it may then be possible to detect even multiple splits of reads. Hence, for finding local alignments in the tested range of error rates STELLAR could replace the heuristic tools.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S9.
References

Sellers PH: The theory and computation of evolutionary distances: Pattern recognition.
Journal of Algorithms 1980, 1(4):359373. Publisher Full Text

Smith TF, Waterman MS: Identification of common molecular subsequences.
J Mol Biol 1981, 147:195197. PubMed Abstract  Publisher Full Text

Paten B, Herrero J, Beal K, Birney E: Sequence progressive alignment, a framework for practical largescale probabilistic consistency alignment.
Bioinformatics 2009, 25(3):295301. PubMed Abstract  Publisher Full Text

Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement.
PLoS One 2010, 5(6):e11147. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dubchak I, Poliakov A, Kislyuk A, Brudno M: Multiple wholegenome alignments without a reference organism.
Genome Res 2009, 19(4):682689. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Raphael B, Zhi D, Tang H, Pevzner P: A novel method for multiple alignment of sequences with repeated and shuffled elements.
Genome Res 2004, 14(11):23362346. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner.
Genome Res 2004, 14(4):708715. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome.
Genome Biol 2009, 10(3):R25. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform.
Bioinformatics 2009, 25(14):17541760. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores.
Genome Res 2008, 18(11):18511858. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program.
Bioinformatics 2008, 24(5):713714. PubMed Abstract  Publisher Full Text

Rumble S, Brudno M: Shrimp – short read mapping package. [http://compbio.cs.toronto.edu/shrimp/] webcite
2008.

Jiang H, Wong WH: SeqMap: mapping massive amount of oligonucleotides to the genome.
Bioinformatics 2008, 24(20):23952396. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Weese D, Emde AK, Rausch T, Döring A, Reinert K: RazerS–fast read mapping with sensitivity control.
Genome Res 2009, 19(9):16461654. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq.
Nat Methods 2008, 5(7):621628. PubMed Abstract  Publisher Full Text

Pearson WR, Lipman DJ: Improved tools for biological sequence comparison.
Proc Natl Acad Sci U S A 1988, 85(8):24442448. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.
J Mol Biol 1990, 215(3):403410. PubMed Abstract

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res 1997, 25(17):33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences.
J Comput Biol 2000, 7(12):203214. PubMed Abstract  Publisher Full Text

Kent WJ: BLATthe BLASTlike alignment tool.
Genome Res 2002, 12(4):656664. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Humanmouse alignments with BLASTZ.
Genome Res 2003, 13:103107. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lam TW, Sung WK, Tam SL, Wong CK, Yiu SM: Compressed indexing and local alignment of DNA.
Bioinformatics 2008, 24(6):791797. PubMed Abstract  Publisher Full Text

Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
Proc Natl Acad Sci U S A 1990, 87(6):22642268. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Döring A, Weese D, Rausch T, Reinert K: SeqAn an efficient, generic C++ library for sequence analysis.
BMC Bioinformatics 2008, 9:11. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

GogolDöring A, Reinert K: Biological Sequence Analysis Using the SeqAn C++ Library. [http:/ / www.crcpress.com/ ecommerce_product/ product_detail.jsf?isbn=97814200762 33] webcite
Chapman & Hall/CRC Mathematical & Computational Biology, CRC Press, Boca Raton, USA; 2009.

Zhang Z, Berman P, Miller W: Alignments without lowscoring regions.
J Comput Biol 1998, 5(2):197210. PubMed Abstract  Publisher Full Text

Zhang Z, Berman P, Wiehe T, Miller W: Postprocessing long pairwise alignments.
Bioinformatics 1999, 15(12):10121019. PubMed Abstract  Publisher Full Text

Rasmussen KR, Stoye J, Myers EW: Efficient qgram filters for finding all εmatches over a given length.
J Comput Biol 2006, 13(2):296308. PubMed Abstract  Publisher Full Text

Burkhardt S, Crauser A, Ferragina P, Lenhof HP, Rivals E, Vingron M: qgram based database searching using a suffix array (QUASAR).
J Comput Biol, RECOMB ’99 1999, 7783. PubMed Abstract  Publisher Full Text

Jokinen P, Ukkonen E: Two algorithms for approxmate string matching in static texts. [http://www.springerlink.com/content/p58155n8012x0477/] webcite
Mathematical Foundations of Computer Science 1991, Volume 520 of Lect Notes Comput Sc 1991, 240248.

Waterman MS, Eggert M: A new algorithm for best subsequence alignments with application to tRNArRNA comparisons.
J Mol Biol 1987, 197(4):723728. PubMed Abstract  Publisher Full Text

Marzal A, Vidal E: Computation of normalized edit distance and applications.
IEEE T Pattern Anal 1993, 15:926932. Publisher Full Text

Arslan AN, Eǧecioǧlu Ö: Efficient algorithms for normalized edit distance.

Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, OsumiSutherland D, Schroeder A, Seal R, Zhang H, Consortium F: FlyBase: enhancing drosophila gene ontology annotations.
Nucleic Acids Res 2009, 37(Database issue):D555D559. PubMed Abstract  Publisher Full Text  PubMed Central Full Text