Abstract
Background
Nextgeneration sequencing (NGS) enables rapid production of billions of bases at a relatively low cost. Mapping reads from nextgeneration sequencers to a given reference genome is an important first step in many sequencing applications. Popular read mappers, such as Bowtie and BWA, are optimized to return top one or a few candidate locations of each read. However, identifying all mapping locations of each read, instead of just one or a few, is also important in some sequencing applications such as ChIPseq for discovering binding sites in repeat regions, and RNAseq for transcript abundance estimation.
Results
Here we present Hobbes2, a software package designed for fast and accurate alignment of NGS reads and specialized in identifying all mapping locations of each read. Hobbes2 efficiently identifies all mapping locations of reads using a novel technique that utilizes additional prefix qgrams to improve filtering. We extensively compare Hobbes2 with stateoftheart read mappers, and show that Hobbes2 can be an order of magnitude faster than other read mappers while consuming less memory space and achieving similar accuracy.
Conclusions
We propose Hobbes2 to improve the accuracy of read mapping, specialized in identifying all mapping locations of each read. Hobbes2 is implemented in C++, and the source code is freely available for download at http://hobbes.ics.uci.edu webcite.
Keywords:
Nextgeneration sequencing; Read alignment; All mapper; Additional prefix qgram; Hobbes2Background
DNA sequencing has become an indispensable tool for basic biomedical research, understanding disease mechanisms, and developing new and personalized treatments. Recent advances in nextgeneration sequencing (NGS) technologies, such as those from Illumina and Life Technologies, have enabled the rapid production of billions of bases at relatively low cost. However, the reads returned by NGS sequencers are usually short (in the range of 35 to 150 bps), and it is left to computational algorithms to extract information from these reads.
In many applications, mapping reads to a given reference genome sequence is an important first step in analysis of sequencing data. Popular read mapping programs, such as Bowtie [1], Bowtie2 [2], and BWA [3], aim at identifying one or a few top mapping locations for each read. This mapping strategy works well for many applications, and leads to a significant improvement in mapping speed compared to programs aiming at identifying all candidate locations. However, in many applications, it is often more desirable to identify all candidate locations of reads. (We will call programs that can identify all candidate locations all mappers). For instance, in ChIPseq experiments, many binding sites are located in the repeat regions of the genomes, and therefore, using read mappers returning only one or a few mapping locations might miss many binding peaks located within these repeat regions [4]. In RNAseq transcript abundance quantification, due to the presence of multiple transcript isoforms caused by alternative splicing, it is critical for a read mapper to return all possible mapping locations. Otherwise, the accuracy of the transcript abundance estimation can be significantly compromised [5,6].
As the sequencing technology is progressing toward producing longer reads, it is also very important to support insertion/deletion (indel) errors, which are caused by sequencing errors and/or genetic variations. Hobbes [7] is a software package proposed to identify all mapping locations of a read. It generates candidate locations efficiently using inverted lists of nonoverlapping qgrams with the help of bit vectors. In the presence of indel errors, however, bit vectors may filter out true locations, which negatively affects the accuracy of the results. Moreover, Hobbes may require a large amount of memory space for bit vectors. Recently developed all mappers, such as RazerS3 [8] and Masai [9], have focused on supporting indel errors and improved the performance in terms of accuracy and mapping time. RazerS3 can generate accurate mapping results by controlling mapping sensitivity based on its errorestimation technique. However, it requires a lot of time to produce highquality results. Masai reduces mapping time significantly by building an index on input reads and simultaneously generating candidate locations for multiple reads. However, Masai does not support multithreading since it builds an index on input reads and it is not straightforward to split input reads so that they can be processed by multiple threads.
In this paper, we present Hobbes2, a software package designed to return all mapping locations of long reads (e.g., 100bp or 150bp) containing indel errors as well as mismatch errors. Hobbes2 is built on top of Hobbes but significantly improves the performance in all aspects. Instead of using bit vectors, Hobbes2 makes use of another inverted list of an additional qgram to filter out false positives during the generation of candidate locations. The filtering based on the additional qgram captures all true locations while we can produce substantially fewer candidate locations. By eliminating bit vectors from memory, this approach also greatly saves memory consumption. Hobbes2 aligns reads one by one and naturally scales well in a multithreaded environment. Because read mappers map a tremendous number of reads, good multithread support is extremely important in read alignment. Through experimental comparisons, we show that Hobbes2 is an order of magnitude faster than the best all mappers, RazerS3 and Masai, while consuming less memory space and achieving a similar accuracy.
In the following sections, we briefly describe existing grambased approaches to solve the read mapping problem and introduce our approach with analysis. Then, we present how to handle indel errors with the proposed approach. We finally discuss the implementation issues to integrate the proposed technique with the existing Hobbes package.
Methods
After we summarize qgrambased approaches for mapping reads to a given reference genome sequence, we propose a filtering technique using an additional prefix qgram. We first restrict our discussion to the readmapping problem with mismatch errors only. Then we explain how to extend the proposed technique to support indel errors in a separate section.
Generating candidate locations using qgrams
A qgram of a genome sequence s is a subsequence of s of length q. The set of locationally overlapping qgrams of s, which is denoted by G(s), is obtained by sliding a window of length q over the bases of s. For example, the overlapping 3gram set of a sequence s = ACCTACCT is G(s)={ACC,CCT,CTA,TAC,ACC,CCT}. Note that we use an ordered multiset for qgrams, where the same qgrams are distinguished by their locations in a sequence. Because a base of a sequence is included in at most q overlapping qgrams of the sequence, a substitution of one base modifies at most q overlapping qgrams of a sequence. Therefore, if the maximum allowed mismatch errors between two sequences of r and s are k bases, they should share at least the following number of common overlapping qgrams (this technique is known as count filtering [10]).
As we look for a genome subsequence s whose length is the same as a read r, we can simplify Equation 1 as follows (because G(r)=G(s)).
Many techniques generate candidate locations using invertedlist structures of overlapping qgrams of a reference genome sequence. An inverted list of a qgram g, denoted by I(g), is a list of locations within a genome sequence where the qgram occurs. For instance, the inverted list of 3gram CCT in our previous example is I(CCT)={1,4} because CCT occurs at locations 1 and 4 in the sequence ACCTACCT. To map qgrams into their corresponding inverted lists, an inverted index is built on overlapping qgrams of a genome sequence. Given a read r and a Hamming distance threshold k (or a maximum number of allowed mismatches k), we can generate candidate locations using an inverted index of a genome sequence as follows. First, we decompose r into overlapping qgrams. For each qgram g in G(r) and its relative location l in r, we retrieve an inverted list I(g) by looking up the inverted index with the search key g. Because I(g) contains starting locations of the qgram g in the genome sequence, we need to modify I(g) by subtracting l from each element in I(g) to find the locations of genome subsequences containing g. We call a modified inverted list a normalized inverted list and denote it by I^{n}(g). We finally select those locations as candidates that appear in at least T normalized inverted lists.
Figure 1 shows an example of a reference genome sequence and its 5gram inverted index, which is taken from ([7]). To map a read r = ACGGTCTTCCCTACGGT with Hamming distance threshold k=2 and T=17−5+1−2·5=3, we first look up the read’s 5grams in the inverted index. Notice that only the grams ACGGT and CGGTC (underlined in the read) are present in the index. We traverse their inverted lists, and normalize each element relative to the location of the corresponding gram in the read. For example, the 5gram CGGTC appears at location 1 in the read, so the relative location of the element on CGGTC’s inverted list is 106−1=105.
Figure 1. Excerpt of a reference sequence and a portion of its 5gram inverted index. The inverted lists of the 5grams ACGGT, CGGTC, and ACCCT are shown, each containing a sorted list of locations in the reference sequence where the respective 5gram appears.
In this way, we can count how many times the read’s grams are contained in the subsequence of the reference sequence starting at a fixed location (location 105, in this example). The gram ACGGT appears twice in the read, and we treat each occurrence as a separate list. Its appearance at location 0 yields a normalized list of {105−0=105,118−0=118}, and a normalized list {105−13=92,118−13=105} for location 13. Next, we count the number of occurrences of each element on the normalized lists. The locations 92 and 118 are pruned according to the count filtering, because their number of occurrences do not meet the lower bound of T=3. Location 105 has a count of 3, and therefore it is a candidate answer whose Hamming distance to the read still needs to be computed.
In most techniques, the naïve count filtering method is not directly used to generate candidates because it requires scanning all inverted lists of overlapping qgrams in a read. To map a 100bp read using an 11gram inverted index, for example, we need to scan 100−11+1=90 inverted lists. Moreover, some inverted lists are usually very long and this method would incur prohibitive scanning costs. Instead, a simple variation of the count filtering known as the prefix filtering [11] is widely used for generating candidates. Because a candidate genome subsequence s needs to contain Tqgrams of a read r according to the count filtering, s must contain at least one qgram among G(r)−(T−1)=k·q+1qgrams in G(r). Given an inverted index of a genome sequence, we retrieve and normalize inverted lists of k·q+1qgrams in G(r) and then generate candidates by taking the union of locations in the normalized inverted lists. To minimize the number of candidates, we sort qgrams in G(r) by their frequencies in the reference genome and take k·q+1 low frequency qgrams (which are called prefix qgrams).
Recent techniques [7,12,13] have focused on deriving a tight lower bound of the number of prefix qgrams. The basic idea behind these techniques is to use nonoverlapping qgrams [14] of a read. If we use nonoverlapping qgrams, a substitution of one base of a read affects only one qgram. Hence, if a genome subsequence s is different from a read r within k mismatches, the overlapping qgram set of s will contain at least one nonoverlapping qgram among k+1 nonoverlapping qgrams of r. Based on the observation, we can generate candidates as follows. We first select k+1 nonoverlapping qgrams in G(r). From the inverted index of overlapping qgrams of a reference genome sequence, we then retrieve inverted lists of the selected qgrams and normalize them. We finally produce candidates by taking the union of locations in the normalized inverted lists. Existing techniques use the sum of frequencies of qgrams to estimate the union size of inverted lists. Hobbes proposed a dynamic programming algorithm to select prefix qgrams based on the following recurrence so as to minimize the sum of frequencies of selected qgrams [7].
where i≤k+1,1≤j≤G(r)−k·q, L[ n].len is the length of the inverted list of the n^{th}qgram in G(r), and M(i,j) is a lower bound on the sum of the lengths of the inverted lists of i nonoverlapping grams starting from a location no greater than j+(i−1)·q. In Equation 3, M(0,j) is initialized to zero and M(i,0) is initialized to infinity and the goal is to compute M(k+1,G(r)−k·q).
Exploiting an additional prefix qgram of a read
Despite of the effort of the recent work, the number of candidates generated by using optimal prefix selection is still too large to refine and/or verify directly. Thus, it is important to further filter out false positives while generating candidates. Hobbes attaches a bit vector to each element in an inverted list and makes use of bit vectors to remove false positives while generating candidates. However, bit vectors greatly increase the size of an inverted index and thus consume a lot of memory space. In this paper, we propose a powerful and memory efficient filtering method. The proposed method does not require additional memory space while it can still filter out more false positives than bit vectors. Our technique is based on the following lemma.
Lemma1 (Additional prefix)
Given an inverted index of a genome sequence and a read with a Hamming distance threshold k, suppose we select k+2 nonoverlapping qgrams from the read. If we retrieve inverted lists of the selected k+2qgrams from the index, we can select those locations as candidates that come from at least two normalized inverted lists.
The intuition of the lemma is that the set of overlapping qgrams in a candidate genome subsequence s must contain at least 2 qgrams among k+2 nonoverlapping qgrams in a read r, because otherwise the difference between r and s would be larger than k bases. We analyze the lemma more precisely using an example as follows. Assume a Hamming distance threshold k is 1 and we select k+2 nonoverlapping qgrams S={g_{1},g_{2},g_{3}} from a read. If we enumerate all possible subsets of S whose cardinality is k+1=2, we obtain three subsets, S_{1}={g_{1},g_{2}}, S_{2}={g_{1},g_{3}}, and S_{3}={g_{2},g_{3}}. As described in the previous section, we can generate candidates using any of three subsets. Let C(S_{i}) be , the set of candidates generated using a subset S_{i} of S. If a candidate location is a true mapping, it should be contained in all of C(S_{1}), C(S_{2}), and C(S_{3}). Therefore, we can generate refined candidates by taking the intersection of C(S_{1}), C(S_{2}), and C(S_{3}). According to this observation, we formulate the set of candidates as
Using the distributive, associative, absorption, and idempotent properties of sets, we can rewrite the formula to
which is illustrated in a diagram in Figure 2(a). If we compare it with the candidate set generated by k+1 prefix qgrams g_{1} and g_{2}, which is depicted in Figure 2(b), we can see that an additional prefix qgram g_{3} plays a significant role of filtering out false positives.
Figure 2. Filtering effect of an additional prefixqgram. Grayscaled areas indicate candidates. (a) An additional prefix qgram g_{3} plays an important role of filtering out a number of false positives in E and F. (b) If we use k+1=2qgrams, g_{1} and g_{2}, much more candidates are generated.
Given inverted lists of k+2 prefix qgrams, in general, we can generate candidates by taking the union of pairwise intersections of the inverted lists. That is, each inverted list is intersected k+1 times. However, we do not need to scan each inverted list k+1 times to generate candidates. Instead, we can use an algorithm that merges all inverted lists by scanning each of them once and selects those locations that appears at least 2 times according to Lemma 1 [see Additional file 1 for the candidate generation algorithm].
Additional file 1. Supplementary material. This file contains supplementary text, algorithm, figures, and tables.
Format: PDF Size: 171KB Download file
This file can be viewed with: Adobe Acrobat Reader
Now, we discuss how to select k+2 nonoverlapping prefix qgrams. We first consider finding optimal k+2 nonoverlapping qgrams, with which we produce the minimum number of candidates. Given a read r with k errors, let n=G(r)=r−q+1 and m=k+2. The number of possible combinations of k+2 nonoverlapping qgrams can be calculated by the following recurrence.
where m≥1, f(n,1)=n, and f(n,m)=0 if n≤0. By solving the recurrence, we obtain
For a 100bp read with 5 errors, there are combinations of k+2=7 nonoverlapping 11grams. We may find an optimal k+2 nonoverlapping qgrams by estimating the size of unions of intersections of inverted lists for each combination. However, this naïve evaluation incurs prohibitive computing time and is not practical.
In this paper, we develop a heuristic approach to select good k+2 nonoverlapping prefix qgrams. We make the following two assumptions. The first assumption is that the union size of inverted lists is proportional to the sum of their sizes. It is worthwhile to note that all techniques based on the prefix filtering make use of this assumption in minimizing the number of candidates. The second is that candidate locations contained in a normalized inverted list are independently distributed and uniformly random. We define the probability of the occurrence of a qgram g in a subsequence be P(g)=I^{n}(g)/N, where N is the total number of subsequences in a reference genome. For any two qgrams of g_{x} and g_{y}, we can estimate I^{n}(g_{x})∩I^{n}(g_{y})≈P(g_{x})·P(g_{y})·N by the second assumption. Given k+2 prefix qgrams of g_{1},…,g_{k+2}, we calculate the union size of pairwise intersections of the inverted lists of the qgrams as
where 1≤x≤k+2, 1≤y≤k+2, and x≠y. Note that we sum intersection sizes to calculate the union size according to the first assumption. Hence, we can minimize by minimizing
For simplicity, we use an upper bound by dropping in Equation 8 and find prefix qgrams that minimize . By selecting k+2 prefix qgrams such that the sum of frequencies of the selected qgrams is minimized, we can minimize since P(g_{i})=I^{n}(g_{i})/N. Therefore, we reuse the recurrence in Equation 3 to select k+2 prefix qgrams and compute M(k+2,G(r)−k·q) using the dynamic programming algorithm proposed in Hobbes. Note that our approach guarantees that the scanning costs of selected k+2 inverted lists are minimized.
Our heuristic solution could accidentally select a poor combination of k+2 prefix qgrams. However, we can expect that the proposed technique will generate fewer candidates, since it takes intersections of inverted lists as illustrated in Figure 2(a) while previous techniques merely take the union of inverted lists for generating candidates. In all experiments we ran, we observed that our solution was always better than the k+1 prefix scheme [see Additional file 1: Figure S1 for experimental comparison between k+1 and k+2 prefix schemes].
Lemma 1 can be generalized by using k+c nonoverlapping qgrams. That is, we can select those locations as candidates that come from at least c normalized inverted lists of k+c nonoverlapping qgrams of a read. However, we focus only on the specific case of c=2 (as described in Lemma 1) for the following reasons. We observed that only the first additional prefix qgram brings a substantial improvement and the effects of additional qgrams other than the first one are not significant [see Additional file 1: Figure S2 for the experimental results of different c values]. As we increase c, we may lose chances to select low frequency qgrams and as a result, the cost of scanning inverted lists would eventually outweigh the savings from reducing the number of candidate locations. Moreover, supporting indels will be more complicated. Thus, we believe that it is not worthwhile to try to find an optimal c value in practice.
Supporting insertions and deletions
If we use an edit distance threshold (i.e., we allow not only substitutions but also insertions and deletions of bases) for mapping a read, indels introduce two potential problems to the above described technique. In this section, we discuss these two potential problems and describe how to fix them. The first potential problem is caused by insertions or deletions occurred between two matched qgrams. In the proposed technique, a candidate genome subsequence s needs to contain at least two qgrams in a read r, where each of which must appear at the same location in both r and s. In case of edit distance constraints, however, the proposed technique could filter out a valid candidate s, since indels between two matched qgrams make the locations of the qgrams in r different from those in s.
For example, consider a genome sequence S_{g}=CCAGTAATGCTGTTG… and a read r=AGTAATCTGTTG. Given an edit threshold k=1, assume that we select k+2=3 nonoverlapping trigrams of g_{1}=AGT, g_{2}=ATC, and g_{3}=TTG in G(r) (underlined in the read) as the prefix qgrams. For g_{1}, we obtain location 2 of S_{g} since g_{1} appears at location 0 in r and at location 2 in S_{g}. For g_{2}, we cannot find a matched qgram in S_{g}. Finally, for g_{3}, we get location 3 of S_{g} since g_{3} appears at location 10 in r and at location 13 in S_{g}. Because there is no location that appears at least twice, we filter out both locations of 2 and 3. However, the edit distance between the read and the subsequence of S_{g} starting at location 2 is 1 and we should be able to return location 2 of S_{g} as a mapping location. The problem is caused by the underlined base G in S_{g}, which is located between two matched grams AGT and TTG as depicted in Figure 3(a).
Figure 3. Problems caused by indels.(a) Indels occurring between two matched qgrams (b) Deletions occurring before any matched qgrams. (c) Insertions occurring before any matched qgrams. (d) Verification windows of a semiglobal alignment algorithm.
To fix this problem, we need to allow gaps between two matched qgrams up to the edit distance threshold. That is, we treat two locations appearing only once as candidate locations if their difference is within the edit distance threshold. In our example, since the difference between the locations 2 and 3 is within the edit distance threshold 1, we generate both of the locations 2 and 3 as candidates.
The second problem is caused by indels occurring before any locations of matched prefix qgrams. If there are d deletions of bases in a reference sequence before the matched qgrams, we need to consider a subsequence starting at l−d, where l is a candidate location calculated from the matched qgrams. For example, consider a genome sequence S_{g}=GAGAGATCTGCATAA… and a read r=GAAGATCTGCATAA, where three underlined trigrams GAT, TGC, and TAA in G(r) are selected as the prefix grams for an edit distance 1. As our technique returns location 1 in S_{g} for all the three qgrams, we use the location as a candidate and verify the genome subsequence s starting at the location. Because the edit distance between r and s is 2, we do not have a mapping of the read r. However, if we consider the subsequence starting from location 0 of S_{g}, we should be able to return the location 0 as a mapping location because the edit distance is 1 as depicted in Figure 3(b). This problem is caused by the deletion of the underlined base G from S_{g}, which is located before the three matched grams GAT, TGC, and TTA.
By contrast, if there are i insertions of bases in a reference sequence before any locations of matched prefix qgrams, we need to consider a subsequence starting at location l+i, where l is a candidate location. For example, consider a genome sequence S_{g}=AGAAGATCTGCATAA… and a read r=GAGAGATCTGCATAA, where three underlined trigrams are matched for an edit distance 1. Although location 0, which is calculated from the matched trigrams, does not satisfy the edit distance threshold, we should be able to return location 1 as a mapping location as depicted in Figure 3(c). The underlined base G in r (or the insertion of the G into S_{g}) causes this problem.
Therefore, given an edit distance threshold k and a candidate starting location l, a potential match can start at any location between l−k and l+k. Similarly, indels can also occur after matched prefix qgrams. Given a candidate ending location c, a potential match can end at any location between c−k and c+k. So altogether, we need to consider a verification window from l−k to c+k to find all potential matches (Figure 3(d)). However, because the verification time based on sequence alignment is proportional to the size of the verification window, enlarging the window at both ends by k is computationally expensive. In Hobbes2, we adopt the following heuristic to improve the mapping speed: We first use the verification window [ l,c+k] (Figure 3(d)), and run a semiglobal banded alignment algorithm to identify all potential matches located within this window. If this verification window yields no matches, we then consider the verification window [ l−k,c]. This approach could potentially miss some true mappings that start before l and at the same time end after c. However, empirically we found that those mappings are relatively rare and do not significantly impact the accuracy of our algorithm.
Implementation details
As Hobbes2 uses an additional prefix qgram instead of bit vectors, it can significantly improve the performance and substantially reduce memory consumption. In this section, we describe how Hobbes2 was implemented on top of Hobbes. Other details and optimization techniques for implementation that are not presented here are the same as those in Hobbes.
Hobbes2 builds an inverted index in the same way that Hobbes does. That is, each element in an inverted list contains a bit vector. However, Hobbes2 loads inverted lists without bit vectors into the memory if it determines that a read has at least k+2 nonoverlapping qgrams, where k is a distance threshold. In this case, Hobbes2 filters out false positives using an additional qgram while it generates candidates. If the number of qgrams contained in a read is less than k+2, Hobbes2 loads both inverted lists and bit vectors into memory. Hobbes2 assumes fixed length input reads and calculate the number of nonoverlapping qgrams using the length of the first input read. For variable length reads, it also safely maps each read but it does not filter out false positives for those reads that have not enough qgrams.
Obviously, the predefined gram length is important for the usability of the proposed filtering technique since it determines the number of qgrams in a read. If we increase the gram length, there could be fewer locations in a genome sequence containing the gram, causing the inverted lists to be shorter. Thus, it may decrease the time for scanning inverted lists and produce fewer candidate locations. On the other hand, the size of a hash table for grams in an inverted index becomes very large. If we increase the gram length by 1, the size of the hash table increases by up to 4 times since we have four distinct bases of A, C, G, and T.
Thus, index lookup time for a gram may be the bottleneck of read mapping as the size of the hash table becomes larger. We found that the mapping speed with an 11gram inverted index was the best when we mapped 100bp reads on HG18 genome sequence [see Additional file 1: Figure S1 for experimental results on gram length]. Based on the experiments, Hobbes2 uses 11grams and thus it can always find enough grams for a 100bp read with up to 7 errors.
Results and discussion
Experimental setup
We implemented Hobbes2 in C++, and compiled it with GCC 4.4.3. All experiments were run on a machine with 94 GB of RAM, and dual Intel Xeons X5670 (12 cores and 24 threads total) at 2.93 GHz, running a 64bit Ubuntu OS. We performed experiments to examine all mapping capabilities of Hobbes2. We focused on edit distance constraints, and all experiments were performed with the edit distance threshold set to be 5 [see Additional file 1: Table S1 for the experimental results with Hamming distance constraints]. Hobbes2 also has an optional mmapping mode, which returns the results of only those reads whose maximum number of distinct mapping locations is less than or equal to a given threshold m. We reported the experimental results on the mmapping mode in Additional file 1: Table S2.
We thoroughly compared Hobbes2 with three stateoftheart all mappers  Hobbes, RazerS3, and Masai, and three other popular read mappers  GEM [15], BWA and Bowtie2. We did not include other all mappers (such as SOAP2 [16], SHRiMP2 [17], mrsFAST [18], and mrFASTCO [19]) in our comparison as it has been shown previously that these all mappers do not perform as well as RazerS3, Masai, and/or Hobbes. We configured read mappers to output results in the SAM format with cigar strings [see Section S3 in Additional file 1 for the details of the configuration of each read mapper].
Index construction and memory footprint
For each reference genome, we built an inverted index of overlapping qgrams on the reference genome. By default, Hobbes2 uses 16bit vectors, resulting in a total index size of 16 GB for the whole human genome NCBI HG18. Hobbes2 loads only the index into memory and the memory footprint of the index for HG18 is about 11 GB. Because Hobbes2 has a tightknit multithreaded framework that parallelizes both indexing and mapping, it took only a few minutes to build an index for HG18.
Single end alignment on simulated data
We generated 100k simulated reads of length 100bp from HG18 using a read simulator, Mason [20]. We used the default profile setting of Mason with the illumina option. We used Rabema [21] benchmark to compare accuracies of read mappers. The benchmark was performed for an error rate 5%, or edit distance 5. To build a gold standard of simulated reads, we used RazerS3 in fullsensitive mode (we ran RazerS3 with its default setting for the performance comparison).
The benchmark found all, all of the best, and any of the best edit distance locations from the mapping results of each mapper. As the simulator generated original locations of simulated reads, we also measured the recall of each mapper, which is the fraction of reads whose original locations correctly reported.
Table 1 shows rabema scores in percentage, each of which is the average fraction of edit distance locations returned by a read mapper per read. Large numbers are total scores and small numbers are scores for reads with errors. We could not measure the mapping time of Masai with 16 threads since it does not support multithreading. We omitted the mapping time of Bowtie2 with a single thread since it could finish only about 10% of 100k reads in 6 hours.
Table 1. Rabema benchmark results of mapping simulated 100k reads of length 100bp against HG18
In terms of the accuracy of all mapping, the top three performers were RaserS3, Hobbes2 and Masai, with an accuracy score of 99.90, 99.85 and 99.83, respectively. Hobbes2 was slightly worse than RaserS3 on reads with high error rates. However, in terms of mapping time, Hobbes2 was much faster than both RaserS3 and Masai  six times faster than RaserS3 and twice as faster than Masai on a single thread, and 20 times faster than RaserS3 on 16 threads (there is no implementation of multithreading on Masai.) BWA and Bowtie2, two most popular read mappers, trailed behind Hobbes2 in both accuracy and mapping time. GEM was faster than Hobbes2 on a single thread but slower than Hobbes2 on 16 threads. Although GEM mapped reads fast but it lost a lot of mapping locations of edit distance 4 and 5 and exhibited poor accuracy.
We also ran BWA, Bowtie2, and GEM in their default mode (or best mapping mode) to compare the performance. The results are reported at the end of Table 1 with mapper names, Bowtie2*, BWA*, and GEM*, respectively. Although they could produce results quickly, they exhibited poor mapping results. In particular, they lost most of locations of edit distance 5 and about a half of locations of edit distance 4.
Single end alignment on real data
We used the human genome with HG18, caenorhabditis elegans (WormBase WS201), and drosophila melanogaster (FlyBase release 5.42) as reference sequences. For the human genome, we used the 100bp reads from specimen HG00096 of the 1000 genome project [22]. We also used 100bp reads taken from the DNA Data Bank of Japan (DDBJ) repository [23] with entry SRX026594 for the worm genome and SRX148416 for the fly genome.
Table 2 lists the experimental results of mapping 1 million reads of length 100bp against the human genome. We excluded Bowtie2 and BWA in the experiment since they are not designed as all mappers and exhibited poor mapping speed when aligning long, repetitive genomes. Hobbes2 mapped more reads than other read mappers while running significantly faster. Hobbes2 with 16 threads was about 3 times faster than Hobbes, 9 times faster than Masai and 42 times faster than RazerS3. By comparing the results of 500k reads and 1 million reads, we observed that the mapping time of each read mapper was approximately proportional to the number of input reads.
Table 2. Results of mapping 500k and 1 million single end reads of length 100bp against HG18
The memory footprint of Hobbes2 was the smallest among the four mappers being compared on the dataset with 1 million reads. The memory requirement of Hobbes2 was independent of the number of input reads because it did alignment read by read. Masai used slightly more memory as the number of reads increased. The memory consumption of RazerS3 was greatly affected by the number of reads and the total number of mappings.
Table 3 shows the results of mapping 1 million reads of length 100bp against the C. elegans and D. melanogaster data sets. Hobbes had much higher memory footprint than Hobbes2 on the D. melanogaster data set. RazerS3 was unable to map reads on the D. melanogaster data set because it used too much memory space and was killed before the job was finished.
Table 3. Results of mapping 1 million single end reads of length 100bp against C. elegans and D. melanogaster
Again, Hobbes2 mapped more reads than other mappers for both data sets. For the C. elegans data set, Hobbes2 with 16 threads was about twice as faster as Hobbes, 7 times faster than Masai and 18 times faster than RazerS3. For the D. melanogaster data set, Hobbes2 also exhibited the best mapping speed in both single threaded and multithreaded cases. Hobbes2 used the least amount of memory in both data sets.
The significant improvement of Hobbes2 is mainly due to our improved method of generating initial candidates. It is very important to quickly generate a small number of candidates in edit distance case since most of the mapping time is spent in the verification process, which requires a more expensive dynamic programming procedure. Table 4 shows the number of candidates initially generated by the read mappers and the time for generating candidates. Hobbes2 generated the least number of candidates among the read mappers. Hobbes generated candidates very fast with the help of bit vectors, but Hobbes2 and Masai generated about four times fewer candidates than Hobbes. Hobbes2 produced candidates more than two times faster than Masai while generating fewer candidates.
Table 4. Filtration of 500k reads of length 100bp on HG18
Paired end alignment
For paired end read alignment, we used the human genome HG18 as the reference sequence. We ran experiments using 100bp read pairs from specimen HG00096.
Our performance results for the paired end alignment are summarized in Table 5. We excluded BWA in the experiment since it does not support the minimum insert size. Bowtie2 in all mapping mode could not finish the mapping in 24 hours. We used Bowtie2 in the default mode, which is listed as Bowtie2* in Table 5. Since Masai does not directly support mapping paired end reads, we separately ran masai_mapper for each read file to output results in Masai’s raw format, and merged the results using masai_output_pe to produce mappings in the SAM format.
Table 5. Results of mapping 1 million × 2 paired end reads of length 100bp against HG18
We observed that Hobbes2 was the fastest among all mappers in both single threaded and multithreaded cases. With 16 threads, Hobbes2 was about twice as faster as Hobbes, and 31 times faster than RazerS3. In terms of mapped pairs, Hobbes2 was similar to Hobbes and RazerS3, but was better than Masai. Hobbes used the least amount of memory for the paired end mapping. Although Bowtie2* ran very fast, it lost many mapping pairs, and thus exhibited poor mapping quality compared with other all mappers.
Conclusion
Hobbes2 efficiently finds all mapping locations of a read in a reference genome. We have shown that Hobbes2 is substantially faster than stateoftheart all mappers while maintaining similar accuracy. In addition, Hobbes2 consumes less memory space than other read mappers for long reads since it does not rely on additional data structures other than inverted lists of qgram signatures.
Our experiments have also shown that Hobbes2 scales very well in multithreaded environment, and exhibits the best performance among the competitors. Given today’s trend toward massively multicore CPUs, read mappers with good multithread support will likely become more necessary in the future.
Because of its simplicity, we believe the candidate generation method implemented in Hobbes2 can also be adapted for other read mapping programs for improving their performance.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JK developed the filtering technique using additional prefix grams and CL applied the technique to the read alignment problem. JK implemented the read mapping algorithm using the technique and ran the experiments. XX designed experiments and analyzed the results. JK wrote the manuscript, and CL and XX read and edited the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We thank Lingjie Weng for helpful discussions on implementation of Hobbes and Hobbes2. We also thank Jacob Biesinger for testing Hobbes2 and providing valuable feedback.
This work was partially supported by Chonbuk National University [Research Funds in 2012 to JK]; the National Institutes of Health R01HG006870.
References

Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryefficient alignment of short dna sequences to the human genome.
Genome Biol 2009, 10:25. BioMed Central Full Text

Langmead B, Salzberg SL: Fast gappedread alignment with bowtie 2.
Nat Methods 2012, 9:357359. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li H, Durbin R: Fast and accurate short read alignment with burrowswheeler transform.
Bioinformatics 2009, 25:17541760. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Newkirk D, Biesinger J, Chon A, Yokomori K, Xie X: Arem: aligning short reads from chipsequencing by expectation maximization.
J Comput Biol 2011, 18:14951505. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Roberts A, Pachter L: Streaming fragment assignment for realtime analysis of sequencing experiments.
Nat Methods 2013, 10:7173. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lil Y, Xie X: A mixture model for expression deconvolution from rnaseq in heterogeneous tissues.

Ahmadi A, Behm A, Honnalli N, Li C, Xie X: Hobbes: optimized grambased methods for efficient read alignment.
Nucleic Acids Res 2012, 40:41. Publisher Full Text

Weese D, Holtgrewe M, Reinert K: Razers3: faster, fully sensitive read mapping.
Bioinformatics 2012, 28:25922599. PubMed Abstract  Publisher Full Text

Siragusa E, Weese D, Reinert K: Fast and accurate read mapping with approximate seeds and multiple backtracking.
Nucleic Acids Res 2013, 41:78. Publisher Full Text

Ukkonen E: Approximae string matching with qgrams and maximal matching.

Chaudhuri S, Ganti V, Kaushik R: A primitive operator for similarity joins in data cleaning. In Proceedings of the 22nd International Conference on Data Engineering: 37 April 2006. Edited by Liu L, Reuter A, Whang KY, Zhang J. Atlanta: IEEE; 2006:515.

Xiao C, Wang W, Lin X: Edjoin: an efficient algorithm for similarity joins with edit distance constraints. In Proceedings of the 34th International Conference on Very Large Databases: 2328 August 2008. Edited by Buneman P, Kersten M, Ozsoyuglu Z. Aukland: VLDB Endowment; 2008:933944.

Qin J, Wang W, Lu Y, Xiao C, Lin X: Efficient exact edit similarity query processing with the asymmetric signature scheme. In Proceedings of ACM SIGMOD International Conference on Management of Data: 1216 June 2011. Edited by Kementsietsidis A, Velegrakis Y. Athens: ACM; 2011:10331044.

Ning Z, Cox AJ, Mullikin JC: Ssaha: a fast search method for large dna databases.
Genome Res 2001, 11:17251729. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

MarcoSola S, Sammeth M, Guigó R, Ribeca P: The gem mapper: fast, accurate and versatile alignment by filtration.
Nat Methods 2012, 9:11851188. PubMed Abstract  Publisher Full Text

Li R, Yu C, Li Y, Lam TW, SM Yiu KK, Wang J: Soap2: an improved ultrafast tool for short read alignment.
Bioinformatics 2009, 25:19661967. PubMed Abstract  Publisher Full Text

David M, Dzamba M, Lister D, Ilie L, Brudno M: Shrimp2: sensitive yet practical short read mapping.
Bioinformatics 2011, 27:10111012. PubMed Abstract  Publisher Full Text

Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC: mrsFAST: a cacheoblivious algorithm for shortread mapping.
Nat Methods 2010, 7:576577. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Alkan C, Kidd JM, MarquesBonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, et al.: Personalized copynumber and segmental duplication maps using nextgeneration sequencing.
Nat Genet 2009, 41:10611067. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Holtgrewe M: Mason  a Read Simulator for Second Generation Sequencing Data. Technical report,. Berlin: Freie Universität; 2010.

Holtgrewe M, Emde AK, Weese D, Reinert K: A novel and welldefined benchmarking method for second generation read mapping.
BMC Bioinformatics 2011, 12:210. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

1000 Genomes: a deep catalog of human genetic variation [http://www.1000genomes.org/data webcite]

DNA data bank of Japan [ftp://ftp.ddbj.nig.ac.jp webcite]