Abstract
Background
The alignment of short reads generated by nextgeneration sequencers to genomes is an important problem in many biomedical and bioinformatics applications. Although many proposed methods work very well on narrow ranges of read lengths, they tend to suffer in performance and alignment quality for reads outside of these ranges.
Results
We introduce RandAL, a novel method that aligns DNA sequences to reference genomes. Our approach utilizes two FM indices to facilitate efficient bidirectional searching, a pruning heuristic to speed up the computing of edit distances, and most importantly, a randomized strategy that enables effective estimation of key parameters. Extensive comparisons showed that RandAL outperformed popular aligners in most instances and was unique in its consistent and accurate performance over a wide range of read lengths and error rates. The software package is publicly available at https://github.com/namsyvo/RandAL webcite.
Conclusions
RandAL promises to align effectively and accurately short reads that come from a variety of technologies with different read lengths and rates of sequencing error.
Keywords:
nextgen sequencing; short read alignment; randomizationBackground
The alignment of reads to genomes is an important problem in many biomedical applications that relied on nextgeneration sequencing technologies. This problem is motivated by the fact that genomes for many species have been sequenced. And since one expects genomes within the same species differ little, such "referenced" genomes can facilitate the assembly of new genomes of other individuals within the same species from short reads. To address this problem, researchers have proposed many approaches together with software packages. Nevertheless, sequencing technologies have advanced rapidly, rendering many of these approaches ineffective or inefficient or both. One aspect that continually changes is the read length. Advanced technologies generally produce longer reads (with better accuracy). On the other hand, technologies that produce shorter reads can be less expensive and are therefore attractive in terms of cost. Thus, it is desirable to have algorithms and tools that perform well across different read lengths ranging from 35 to several hundreds basepairs.
Nevertheless, many existing algorithms struggle to perform consistently across a wide range of read lengths. Methods such as Bowtie [1] and BurrowsWheeler Alignment (BWA) [2] tend to perform better with shorter reads. Bowtie uses the BurrowsWheeler Transform (BWT) and FM index to build a permanent index of the reference genome. It then applies backtracking algorithm to find alignments. BWA also utilizes the BWT, but unlike Bowtie, can handle gaps and mismatches in the reads. More advanced versions of these methods include Bowtie2 [3] and BWASW [4] which are designed to work with longer reads. Bowtie2 can align reads with gaps and works better than Bowtie at longer reads. BWASW exploits the BWT and several heuristics to speed up the local alignment of reads.
Many techniques utilize data structures and techniques such as the BWT, FM index, suffix arrays, suffix trees/tries, hash tables or qgrams [511], aiming to speed up substring querrying. Additional heuristics are also used to enhance efficiency. Bowtie2 [3] and CUSHAW2 [12], for example, use seeds to quickly identify true candidates for alignment. GASSST [9] uses a filtering technique to reduce noisy seeds. Implementations of some of these approaches, e.g. Bowtie2, CUSHAW2, take advantages of parallelism or specialpurpose architectures. The use of heuristics can improve performance several folds, but might lead to overtuning parameters to a particular set of inputs, e.g. read lengths, species, or base error rates.
We introduce RandAL, an aligner based on a novel algorithm that performs consistently well over a wide range of read lengths, from 35 to several hundreds base pairs. We employ two FM indices for efficient bidirectional (exact) substring matching. To deal with inexact matching (i.e. allowing gaps), first, we find common substrings between reads and the reference genome. Then, these common substrings are extended to complete alignments based on a bounded threshold on the edit distance. We use a special pruning mechanism to shorten vastly the running time of computing edit distances in a vast majority of cases. The use of randomization in aligning reads to genomes increases the probability of finding seeds quickly and enables us determine methodologically important parameters to speed up the entire alignment process. Preliminary results show that our algorithm performed consistently well on a wide range of read lengths across several bacterial and eukaryotic genomes. The alignment quality of our method was better or generally as good as that of all compared methods.
Methods
Given a reference genome
Our strategy for read alignment is based on these ideas:
1 Detection of identical substring matches between r and
2 A special data structure called the FM index is used to facilitate memoryefficient,
timeoptimal exact string matching. This data structure facilitates efficient detection
of long common substrings between r and
3 Randomization is employed to find common substrings between r and
4 To account for insertion/deletion polymorphisms, we utilize the edit distance to provide an accurate measure for read alignment. Additionally, we employ a pruning heuristic to shorten the computation of edit distance, without com promising quality of alignment.
These ideas will be discussed in greater detail in the following sections.
Indexing the reference genome
Naive string matching takes quadratic time and therefore is too costly. Researchers have used data structures such as suffix tree, suffix array, and FM index to speed up string matching significantly. The FM index [13] in particular is desirable because it allows exact string matching to be done optimally in O(m) time, where m is the length of the query (i.e. the read), and is very space efficient. The FM index of the genome is a substring index that takes advantage of properties of the BurrowsWheeler transform to search incrementally all suffices of a read in the reference genome. This allows linear time (in read length) searching for exact substring matches. By design, the search direction is in reverse (backward) order with respect to the sequence.
To facilitate bidirectional string matching (to be discussed next), we employ two
FM indices. A conventional FM index that traces substring matches backward is denoted as
Finding common substrings between reads and genomes
Given a read r and a specific position p in r, Algorithm 1 outlines the steps in finding longest common substrings of r and the reference genome
Figure 1. Illustration of Algorithm 1: finding common substring. r_{i…p−1 }and r_{p…j }may match several substrings of the genome
The choice of W is important. If W is too small, M is large, and we will consider many common substrings between the read and the genome to construct alignments between the read and the genome. The more common substrings we consider, the more likely we can find the correct position of the read in the genome to align; but we also more likely make mistakes of aligning the read to an incorrect position. In other words, with smaller W, we might get more true positives (correct alignments) and more false positives (incorrect alignments) at the same time. On the other hand, if W is too large, we might not be able to find any common substrings and consequently unable to align the read to the genome. Therefore, inappropriate choices of W results in bad performance.
Algorithm 1 CommonSubstrings(read r, position p)
1: Let B be substrings of reference genome
2: Let F be substrings of reference genome
3: M := ∅
4: for each b ∈ B do
5: for each f ∈ F do
6: Let s := b ⊕ f be a concatenation of b and f.
7: if s is a contiguous block in
8: M := M ∪ s
9: return M
Our strategy for determining good values of W is based on randomization. As we shall see soon, the value p given to Algorithm 1 would be a random index of the read. To calculate W, first suppose that the correct substring of the reference genome
The random choice of p implies that the common substring found by Algorithm 1 would be a random block, which
is selected with probability
The CaucheSchwarz inequality tells us that
After simplifying, these imply that
Lemma: The expected length of the common substring between a read and the reference genome
found by Algorithm 1 is at least
Although we do not know what d, the distance between r and its aligned substring r', is, it can be estimated by the rates of single nucleotide polymorphism (SNP) of the given genome and given rate of sequencing error. Let b be the rate of each nucleotide being mutated or sequenced erroneously, which we may assume to be distributed by a binomial distribution with mean µ = mb and variance σ^{2 }= mb(1  b), where m is the read length.
Although we do not know exactly what d is, its upper bound t might be estimated by µ + cσ, for some constant c. With 100,000 reads, we found that c = 4 produces good performance with high true positives and low false positives.
In summary, the two critical parameters of our method t and W are methodologically derived as follows:
• The upper bound of the distance between a read and its aligned string,
• The lower bound of the expected length of common substrings,
W appears in Algorithm 1, and t appears in Algorithm 2, which is the next step after finding common substrings between reads and the reference genome.
Algorithm 2 AlignRead(read r)
1: p := 1
2: m := r
3: for i from 1 to A do
4: C := ∅
5: M := CommonSubstrings(r, p)
6: for each s ∈ M , which is a substring of
7: Let r_{i…j }be the substring of r that matches s exactly.
8: Let s_{L }be the (i − 1)substring of
9: Let sR be the (m − j)substring of
10: d := editdist(r_{1…i−1}, s_{L}) + editdist(r_{j+1…m}, s_{R})
11: if d ≤ t then
12: C := C ∪ (s_{L }⊕ s ⊕ s_{R})
13: if C has at least one sequences then
14: Return "fail to align", if C has more than 2 sequences.
15: Otherwise, align read r to each sequence of C. STOP.
16: p := random(1,r)
17: return "fail to align"
Extending common substrings to align reads to referenced genomes
Using long exact common substrings as seeds to align reads to genomes is similar to [3,12]. Our approach promises to be efficient because instead of exhaustively traversing indices of a read to find optimal common substrings, we find common substrings with respect to random index p of the read.
In Algorithm 2, we iterate at most A times to find long common substrings between
Figure 2. Illustration of Algorithm 2: extending common substring to alignment. Alignment of a read r to the reference genome
Note that in the first iteration, the position p is 1 and not a random index of r. The reason for this is that we would like the method of finding long common substrings (Algorithm 1) to be symmetrical in the sense that b and f could "wrap around" r. In other words, when p = 1, b is a suffix of r and f is a prefix of r. In this case, the concatenation of b ⊕ f is not a contiguous substring, but rather two contiguous strings separated by a big gap. This conceptualization of "wrapping around" the read, or thinking of it as a circular instead of linear string, turns out to be quite effective in practice. In many cases, p = 1 leads to very long common substrings that lead to correct alignments of reads.
If we cannot align r to any substring of
Thus, setting A = t + 1 ≥ d + 1, the longest common substring between a read and the genome is sampled expectedly after A iterations. Further, if A = c … (t + 1), then the probability of landing in the longest block is exponentially increased as a function of c. Trading for speed, c = 1 seems to work fine in practice, because even if Algorithm 1 does not return the longest common substring, it is often possible to extend it to find the correct alignment for the read. But longest common substrings minimizes the chance of running into repeats in the genome; i.e. common substrings upon which extensions will lead to incorrect alignments.
Fast heuristic for computing edit distances
Computing edit distances consumes much time of the alignment algorithm (Algorithm 2). In steps 1011 of Algorithm 2, we compute the edit distance between a read and a substring of the genome and discard it if the distance is greater than t. As each read often match with few substrings of the genome, we expect that such edit distances often exceed t. Examining lines 1011 of Algorithm 2, we see that actually we do not need to compute the exact value of d(x, y), the edit distance of x and y, as long as we can answer correctly the query d(x, y) ≤ t.
We claim that the edit distance of x and y, d(x, y) ≤ t if and only if Bound(x, y, t) ≤ t, where Bound is defined in Algorithm 3. To see this, observe that
• If d(x, y) ≤ t, then Bound(x, y, t) returns d(x, y).
• If d(x, y) > t, then Bound(x, y, t) returns either d(x, y) or t + 1. The only difference between Bound and the conventional edit distance lies in line 6 of Algorithm 3. Analyzing line 5, we see that once d_{i,j }> t for 1 ≤ j ≤ m (line 6), then d_{m,m }> t.
If d(x, y) > t, Bound(x, y, t) might not compute the edit distance correctly. Nevertheless, d(x, y) ≤ t if and only if Bound(x, y, t) ≤ t. For aligning reads to bacterial genomes, Bound is much faster than the worstcase complexity Θ(m^{2}).
Algorithm 3 Bound(x, y, t)
1: d_{i,0 }:= 0 for 0 ≤ i ≤ x
2: d_{0,j }:= 0 for 0 ≤ j ≤ y
3: for i := 0 to x do
4: for j := 1 to y do
5: d_{i,j }:= min(d_{i−1,j−1}+(xi == y_{j}), d_{i−1,j}+ 1, d_{i,j−1}+1)
6: return t + 1 if d_{i,j }> t for 1 ≤ j ≤ max{x, y}
7: return d_{x,y}
Results
RandAL is implemented in C++; FMindex codes are adapted from an external library (http://code.google.com/p/fmindexplusplus webcite). We compared our method against several aligners including Bowtie [1], BWA [2], Bowtie2 [3], BWASW [4], and CUSHAW2 [12]. We chose these methods based on the fact that they are recently published, very popular and their software are available. Comparison tests were conducted on a workstation with two Intel Xeon E52680 2.70GHz CPU and 64 GB RAM.
Each aligner is tested with 100,000 simulated reads generated for each of 6 bacterial genomes and 6 chromosomes of eukaryotic genomes. Sizes of these genomes range from 1.3 and 28 millions bases; see Table 1. Genomes were obtained from EMBLEBI (http://www.ebi.ac.uk/genomes webcite). Since recent sequencing technologies produce read lengths ranging from 35 to 400bp at greater speed and lower cost than previous technologies (e.g. Sanger sequencing) [14], we choose this range of read lengths to evaluate the methods. More specifically, the reads were generated at lengths 35, 51, 76, 100, 200, and 400 as these lengths have been mentioned in the literatures. The wgsim C program, part of the SAMtool package [15], was used to generate reads.
Table 1. Reference genomes, obtained from EMBLEBI (http://www.ebi.ac.uk/genomes webcite).
Extensive comparisons were performed using SAMtool's default settings, with base error rate at 2%; 15% of polymorphisms are indels with lengths drawn from a geometric distribution with density 0.7 ∗ 0.3l−1. Additionally, we present summary results for 1% and 4% base error rates with similar trends and conclusions.
Aligned reads from aligners are evaluated using the wgsim_eval Perl script, a part of the SAMtool package, using the default setting in which a read is mapped correctly if its mapping position is within a distance of 20 from the correct position. To compare alignment quality, we defined:
Alignment quality of 6 aligners
At the outset, we found that Bowtie and BWA were decent performers when read lengths were short, i.e. between 3156 bases. When read length increased, however, these two aligners were not competitive. Figure 3 shows the average performance (precision in the yaxis versus recall in the xaxis) of 6 aligners on 6 bacterial genomes and 6 eukaryotic genomes, respectively. Both BWA and Bowtie suffered from a decrease of precision as read length increases. For BWA, recall peaked at around 94% even at longer reads. Bowtie did better at recall for longer reads than BWA, but it was still not competitive to the other 4 aligners, including RandAL. Its bad performance at longer reads is unacceptable because technological trends tend to produce longer reads. For this reason, we will drop them out of headtohead comparisons at this point, and will only compare the 4 best aligners: Bowtie2, BWASW, CUSHAW2 and RandAL.
Figure 3. Alignment performance across 6 different read lengths. Recall (xaxis) versus Precision (yaxis) averaged across bacterial genomes and eukaryotic genomes, respectively, at read lengths of 35 bp, 51 bp, 76 bp, 100 bp, 200 bp and 400 bp.
A closer look at Figure 3 reveals that BWASW was relatively competitive but come roughly in the last place. There is no consistent winner (in terms of both precision and recall) among the top 3 performers, Bowtie2, CUSHAW2, and RandAL. Nevertheless, we can see that RandAL did noticeably better in terms of precision and was still competitive in terms of recall. Importantly, we see that across the wide range of read lengths from 35 to 400 for both bacterial and eukaryotic genomes, the performance of RandAL was consistently high in terms of both precision and recall; average precision was never below 0.98 and average recall was never below 0.95. This consistency distinguishes RandAL from the other top aligners.
An even closer look at individual bacterial and eukaryotic genomes (Figure 4) further reinforces the consistency in performance of RandAL. The lowest precision RandAL got in all 12 genomes was about 0.96, and the lowest recall was about 0.90. In comparison, for the other top performers, the lowest precision was about 0.93 and lowest recall was about 0.80.
Figure 4. Recall versus precision of top 4 aligners. Performance of top aligners on bacterial genomes (top 6 figures) and eukaryotic genomes (bottom 6 figures). Xaxis is recall; Yaxis is precision.
All top 4 aligners perform really well in both precision and recall as read length increases. Their performance was quite similar at 400 read length. At shorter read lengths, however, RandAL outperformed the rest, often in both precision and recall.
Rates of misalignment of top 4 aligners
Misalignment means aligning a read at an incorrect position. Misalignment increases the likelihood of running into problems later when we are interested in assembling reads into a complete genome and to identify where the constructed genome different from the reference genome (SNP calling).
The misalignment rate is calculated by dividing the number of incorrect aligned reads by the total number of reads. Figure 5 shows that averaging across all bacterial and eukaryotic genomes, RandAL got noticeably lower misalignments than the other aligners at all different read lengths. This result suggests that RandAL will likely work well with other tools and methods that require alignment of reads to reference genomes.
Figure 5. Rate of misalignment. Rate of misalignment averaged across bacterial and eukaryotic genomes.
Alignment quality at different base error rates
We have compared performance of 4 different methods using a base error rate of 2%; each nucleotide is mutated with the probability of 2%. Due to the lack of space, we cannot present a comprehensive comparison at different base error rates, as we have at 2%. Nevertheless, analyses at different base error rates show similar behaviors as we have observed at 2% error rates. We present a summary analysis at two other base error rates of 1% and 4% at read lengths of 35 bp, 100 bp, and 400 bp. Table 2 summarizes the average precision and recall of the top 4 aligners. These numbers suggest the followings:
Table 2. Average precision and recall at 1% and 4% base error rates.
1 All methods performed well at 1% base error rate.
2 With 4% base error rates, the other methods suffered, particularly with shorter reads. The best of them (Bowtie2) got ∼63% recall at 35 bp. Low recall rate means few reads (out of the total number) were aligned correctly.
3 Our method consistently achieved the highest performance (or among the highest performance) across different read lengths and base error rates. In precision, our method always got the highest, consistently above 97.8%. In recall, even at worst case of 4% base error rate and 35 bp read length, we got ∼94%.
Raw running times of top 4 aligners
Theoretically, asymptotic complexity of our method in aligning a read of length m is proportional to m + m^{2}. The worst case complexity of m^{2 }is due to edit distance computation. The heuristic for computing edit distance, however, reduces this worstcase complexity significantly in practice. Our testing showed that the running times of other methods, like ours, did not depend much on genome sizes.
Table 3 shows the averaged running times (in seconds) of the 4 aligners in aligning 100,000 reads. Our method suffered with shorter reads, but were the second fastest with longer reads (≥ 100 bp). It seems that the benefit of randomization becomes more evident with longer reads.
Table 3. Average running times of top 4 aligners at different read lengths.
Bowtie2 was the fastest across the board, but as shown in the previous section, its alignment quality is not as good as our method or CUSHAW2. Compared to ours, CUSHAW2 was significantly slower. Observing running times at different read lengths, we speculate that CUSHAW2 might be much be slower than ours with longer reads.
Difficulty of alignment in the presence of repeats
Although eukaryotic genomes are expected to be harder to align than bacterial genomes,
an examination of performance of the top 4 aligners in Figure 4 reveals that these aligners did not always perform better on bacterial genomes; eukaryotic
genomes were not always harder to align. To quantify the degree of difficulty in aligning
reads to genomes, we define repeat density as a measure of how many repeats a genome has. Since repeats directly affect alignment
quality, the notion of repeat density is meant as a quantifiable approximation of
genome complexity. More precisely, given a genome
Table 4. Repeat density of genomes, D(Sk), at various length k.
Table 5 shows that repeat density is correlated highly negatively to performance (precision
and recall) of all aligners. This means the larger
Table 5. Pearson correlation coefficients of repeat density and performance
Conclusions
We introduced RandAL, a novel randomized approach to aligning reads to reference genomes. We showed that it performed among some of the top aligners that currently exist. Unlike the other aligners, however, RandAL distinctly performs consistently well across a wide range of parameters (read lengths and error rates) across all tested bacterial and eukaryotic genomes. As current sequencing technologies can produce reads in the tested range at low cost [14], our approach promises to work well with these technologies.
Using repeat density as a measure of genome complexity, we showed that this measure correlated highly negatively with alignment quality (precision and recall). This implies that for larger and more complex genomes with many more repeats, these aligners will similarly suffer, as expected.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
VP conceived and designed the algorithm and experiments. NSV, QT, and NN implemented the algorithm. NSV and QT collected data, implemented experiments, carried out evaluation and data analysis. All authors read and approved the manuscript.
Acknowledgements
This research was partially supported by NSF grant CCF1320297 to VP.
Declarations
Publication charges for this work were funded by NSF grant CCF1320297 to VP.
This article has been published as part of BMC Genomics Volume 15 Supplement 5, 2014: Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S5.
References

Langmead B, Trapnell C, Pop M, Salzberg SL, et al.: Ultrafast and memoryefficient alignment of short dna sequences to the human genome.
Genome Biol 2009, 10(3):25. BioMed Central Full Text

Li H, Durbin R: Fast and accurate short read alignment with burrowswheeler transform.
Bioinformatics 2009, 25(14):17541760. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Langmead B, Salzberg SL: Fast gappedread alignment with bowtie 2.
Nature Methods 2012, 9(4):357359. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li H, Durbin R: Fast and accurate longread alignment with burrowswheeler transform.
Bioinformatics 2010, 26(5):589595. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li H, Ruan J, Durbin R: Mapping short dna sequencing reads and calling variants using mapping quality scores.
Genome research 2008, 18(11):18511858. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li R, Li Y, Kristiansen K, Wang J: Soap: short oligonucleotide alignment program.
Bioinformatics 2008, 24(5):713714. PubMed Abstract  Publisher Full Text

Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M: Shrimp: accurate mapping of short colorspace reads.
PLoS computational biology 2009, 5(5):1000386. Publisher Full Text

Homer N, Merriman B, Nelson SF: Bfast: an alignment tool for large scale genome resequencing.
PLoS One 2009, 4(11):7767. Publisher Full Text

Rizk G, Lavenier D: Gassst: global alignment short sequence search tool.
Bioinformatics 2010, 26(20):25342540. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ahmadi A, Behm A, Honnalli N, Li C, Weng L, Xie X: Hobbes: optimized grambased methods for efficient read alignment.
Nucleic Acids Research 2012, 40(6):4141. Publisher Full Text

Weese D, Emde AK, Rausch T, Doring A, Reinert K: Razersfast read mapping with sensitivity control.
Genome Research 2009, 19(9):16461654. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu Y, Schmidt B: Long read alignment based on maximal exact match seeds.
Bioinformatics 2012, 28(18):318324. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ferragina P, Manzini G: Indexing compressed text.
J ACM 2005, 52(4):552581. Publisher Full Text

Schatz M, Delcher A, Salzberg S: Assembly of large genomes using secondgeneration sequencing.
Genome Research 2010, 20(9):11651173. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The sequence alignment/map format and samtools.
Bioinformatics 2009, 25(16):20782079. PubMed Abstract  Publisher Full Text  PubMed Central Full Text