Abstract
Background
Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on positionspecific scoring matrices, which can take into account not only the quality scores of the reads but also userspecified models of evolution and dataspecific biases.
Results
We show how evolution, dataspecific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PARCLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both singleend and pairedend data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans).
Conclusion
The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.
Keywords:
Shortread mapping; Sequence alignment; Nextgeneration sequencing; Ancient DNA; PARCLIP; Xeno mappingBackground
Nextgeneration DNA sequencing is a powerful tool in biological research [1] and is steadily gaining momentum as costs keep decreasing. Applications vary from genome resequencing [24] to transcriptome analysis [57], metagenomics projects [810], and sequencing of ancient genomes [1116]. All these applications rely on mapping reads to existing reference genomes. Many mapping programs have been developed using a variety of algorithms with different strengths, weaknesses and limitations [17].
Hashbased algorithms such as MAQ [18] and SOAP [19] dominated initially but were hampered by large memory demands. Subsequently, the BurrowsWheeler transform [20] was applied to compress the genome index in programs such as Bowtie [21], Bowtie2 [22], SOAP2 [23], and BWA [24]. This decreased the memory usage while increasing speed and sensitivity, leading to mappers based on the BurrowsWheeler transform to now dominate the field. Other approaches, however, show promising results for some types of data. Segemehl [25] uses an enhancedsuffix array to provide fast alignment of insertion/deletion (indel) prone reads, and a similar approach was implemented in the mapping tool used in the sequencing of the first ancient human genome [13]. Programs such as CUSHAW [26] and SOAP3 [27] have begun to use graphics processing units (GPUs) to provide even faster mapping.
Most programs allow a specified number of mismatches in an alignment (with a limit of 13 mismatches in the beginning of the read), and report uniquely mapped reads as those where all other locations have more mismatches. However, evaluating a mapping location by the number of nucleotide mismatches alone is not optimal and implicitly assumes that the genome has a homogeneous base composition and that errors occur uniformly in the reads. The mapping with the lowest number of mismatches may have a high probability of being incorrect if i) there are many suboptimal mappings, ii) the genome has a very biased base composition, iii) certain nucleotide mismatches are expected due to sample conditions, or iv) the mismatching bases have low errorprobabilities compared to other bases.
Most sequencing platforms provide a quality score for each base derived from the probability that the nucleotide is wrongly assigned in the basecalling. With the Illumina platform, the error probabilities typically range from around 0.01% in the 5’ end of the read to several percent in the 3’ end, but the actual DNA sequence can affect the read quality [28]. These qualities can affect not only the ability of a mapper to find the correct hit, but also the quality of the reported hit. While the latest generation of mappers such as MASAI [29] and GEM [30] either do not take quality scores into account or only consider them in a rudimentary manner, they also report all possible alignments and do not provide a mapping quality to distinguish between a high confidence alignment and a low confidence one. We demonstrate that taking quality scores and other information about the biases in the experimental data into account can improve the sensitivity while providing an accurate mapping quality estimation. Recently, the use of positionspecific scoring matrices (PSSMs) has been applied to the short read mapping problem and shown to provide accurate SNP and indel calling [17,31].
Here we show how quality scores, contamination, biases in base composition, mutations, and dataspecific basechanges (as in PARCLIP or ancient DNA) can all be dealt with probabilistically and encoded in a PSSM. We also present an algorithm which applies the BurrowsWheeler transform to the scoring of PSSMs, and BWAPSSM, a fast and sensitive mapping tool which implements the methods.
We show that the use of probabilistic scoring allows for both higher sensitivity and positive predictive value when mapping simulated reads. More importantly, we offer the ability to use a specified error model to map reads based on the expected types and locations of mismatches. This improves the mapping of ancient DNA data with errors due to damage, and the improvement is even more pronounced when mapping PARCLIP data, where there is a strong bias towards TtoC substitutions [32], as well as data from P. falciparum which has an extreme nucleotide bias with more than 80% AT content [33]. We also show that the probabilistic scoring can help in cases of contamination by dramatically reducing the mapping of short reads from E. coli to the human genome.
Results and discussion
When mapping reads, one is interested in the probability that the read originated from a specific location in the genome. The read and the genome might not be completely identical due to e.g. sequencing errors or SNPs.
A positionspecific scoring matrix from quality scores
Most sequencing machines provide a quality score for each base which is related to the probability of a sequencing error occurring at this position in the read. These qualities are normally in the Phred format [34] and relate the error probability p_{e} and the quality score Q by p_{e} = 10^{Q/10}. From these qualities, we can calculate the probability P(ax) for the base a ∈ {A,C,G,T} being present at a given position in the DNA fragment given that the base x ∈ {A,C,G,T} was called by the sequencing machine (see Methods).
If the DNA being sequenced differs from the reference genome, e.g. due to evolution, there is a probability P(ga) that base g occurs in the genome if the base is a in the sample. We use a simple model with a probability p_{0} for a mutation (see Methods).
Combining this with the probability of errors, the probability of a base g in the genome given the called base x is
In some types of data, known base modifications are known to occur. For instance, in ancient DNA some bases are damaged due to hydrolysis, resulting in cytosine to thymine (CtoT) conversions in the 5’ ends of reads, which can look like apparent guanine to adenine (GtoA) substitutions in the 3’ end, and in PARCLIP experiments there is a large fraction of thymine to cytosine (TtoC) conversions where the protein crosslinks to the RNA. These phenomena are easily modeled and included in the PSSM, essentially by introducing a probability P(ba) that the observed base is b, given that the base is a in the sample (see Methods section).
Using this approach, one can turn a short read into a PSSM and use it for mapping the read to the genome. The PSSM is constructed such that at position i in the read, the score for base g is s(gx_{i}) = log_{2}(P(gx_{i})/q(g)), where P(gx_{i}) is the probability of a genomic base g given the i’th base x_{i} in the read calculated as described above in equation 1, and q(g) is a background probability, which is usually just the frequency of the base in the genome. The score for matching a read x to a certain position ℓ in the genome is S_{x}(ℓ), and is obtained by summing the scores from the PSSM corresponding to the genome sequence starting at position l.
BWAPSSM is very flexible when it comes to how you feed the PSSM to the program. In the present implementation, the PSSM can be constructed from the quality scores supplied with the sequence, using a background distribution calculated from the frequencies of bases in the reference genome and a mutation rate (p_{0} above). Alternatively, the user can supply a table with a direct translation of base and quality score pairs to PSSM scores. Such a table can be computed ahead of time to include for instance a more sophisticated evolutionary model or a PARCLIP model. Finally the user can input a fully constructed PSSM and use it for mapping.
Mapping probability
The advantage of the PSSM is that the probability of a match can be calculated directly [17]. There is a probability that a read x originates from – or is homologous to – a position ℓ in a genome g, P(ℓ,Mx,g), where M means the "match model". Alternatively, the background model N is used if the read does not originate from the genome due to e.g. contamination or adapter sequences. A priori, we may be able to estimate the probability P(Nx) of a read being contamination, and obviously P(Mx) = 1  P(Nx).
Using the sum and product rule we can express the match probability at position ℓ as
Assuming that any mapping position is equally likely a priori we have P(ℓM,x) = 1/L, where L is the length of the genome.
In the background model N we assume independently identically distributed (i.i.d.) bases.
In the match model M, the probability of the aligned bases in g is calculated from the PSSM and the remaining bases are assumed to be distributed as in the background model; this means that
, where S_{x}(ℓ) is the PSSM score. It is possible, of course, to have more sophisticated background models of those sequences that do not originate from the genome, such as a Markov model, but this will not be considered here.
Using the identity , where ℓ^{′} runs over all positions in the genome, we can finally write the posterior probability of the read x mapping to position ℓ in the genome as
The first term in the denominator is the sum over all possible genomic positions. This is impractical to calculate, but since only positions with some similarity to the query yield a significant contribution, it is well approximated by a sum over highscoring mappings. The second term in the denominator is proportional to the genome size and reflects the fact that the larger the genome, the more likely it is to have a random hit.
In the BWAPSSM implementation we assume the same prior match probability for all reads, P(M) = P(Mx), which can be specified as a parameter with a default value of 0.8 which is used for all experiments in this paper. To simplify the presentation we assumed no indels in the alignment of the read and the genome, see Additional file 1 for the derivation of equation (2) and Methods for details on handling indels.
Additional file 1. Supplementary material.
Format: PDF Size: 1.2MB Download file
This file can be viewed with: Adobe Acrobat Reader
Algorithm and implementation
PSSM search is implemented in the BWA program [24] as a separate version called BWAPSSM, which can be used instead of the regular BWA alignment step (aln). Here we describe the main algorithmic changes as compared to standard BWA.
Using a PSSM, a score can be calculated for each position in the reference genome, where high scores indicate hits. Using a BurrowsWheeler transformed index, this operation can be sped up by evaluating scores on the prefix tree of the index. At any given point in the search, the position within the scoring matrix corresponds to the current depth of the prefix tree. Scores are calculated by summing the positionspecific score at each node along a path.
To bound the number of positions in the genome that must be evaluated, a threshold score is used which replaces the maximum number of mismatches used in standard BWA. To calculate an overall score threshold for a read, we rely on converting a limit on the number of mismatches n into the minimum possible score with n mismatches. That is, we allow n mismatches of bases with high quality and more than n mismatches of lowquality bases.
To further prune the search space, lookahead scoring can be used, in which the threshold is calculated for each position in the PSSM as the difference between the threshold and the best possible score of the remaining PSSM [35]. This is implemented in BWAPSSM (Algorithm 1), but using an improved bound. This is done by adapting the method (named CALCULATED) employed by BWA to consider what the minimum number of mismatches must be for that subsequence to align to some region of the genome. Our algorithm, called CALCULATET (Algorithm 2), instead calculates the difference between the best possible PSSM score with no mismatches and the best possible score with the minimum number of necessary mismatches. This difference is calculated at each position and added to the lookaheadderived intermediate thresholds. This has the effect of requiring a higher match score at each position and thus further bounding the search tree. This allows for faster and more accurate mapping of sequences with many low quality bases as more likely paths (and thus matches) will tend to be visited first.
Algorithm 1 The recursive BWAPSSM search algorithm. The PSSM is denoted A and the PSSM thresholds at each positioni are stored in T[i]. A score, s, is maintained for every partial match and the indices into the BurrowsWheeler transformed sequence are stored in k and l. In the algorithm O(b,l) denotes the number of occurrences of the base b in the l’th prefix of the BurrowsWheeler transformed reference sequence, and C(b) is the number of occurrences of bases that are lexicographically smaller than b in the reference sequence. Insertions and deletions are assigned penalties ρ_{i} and ρ_{d}, respectively. The algorithm is initiated with the values PSSMSEARCH(A,T,x1,0,g1), whereT is calculated from the sequence x using the algorithm CALCULATET and g denotes the size of the reference sequence.
Algorithm 2 Calculation of intermediate thresholds. The algorithm calculates intermediate thresholds for PSSM A, read sequence x, genomic reference sequence g given the global threshold t. x_{j:i} is the subsequence from j to i of the read, and t stores the difference between the best and second best PSSM score that can be obtained for the subsequence. The MINDROP(A,i,j) function simply calculates the minimum of the differences between the highest and lowest scores for columns i to j in the PSSM, while the function SUMMAX(A,0,i  1) calculates the sum of the maximal values in column 0 to column i  1 in the PSSM.
While BWA uses the MAQ mapping quality score (MapQ), BWAPSSM reports the posterior probability given in equation (2), but estimating the sum in the denominator from the matches above the threshold. In keeping with the MAQ tradition, this probability is also log scaled, rounded to an integer and reported as the MapQ score in the output, that is .
Comparing methods on simulated reads
The performance of BWAPSSM was compared to BWA [24], BWAMEM [36], Bowtie [21], Bowtie2 [22] and GEM [30] on a number of simulated datasets modelling different types of short reads. The results are summarized in Tables 1, 2 and 3 and Figures 1 and 2, and details are given below. Each program, except Bowtie, reports a mapping quality (MapQ), and the mapping performance clearly depends on which threshold is used for this. We report the unfiltered results, for which we use no MapQ threshold, and the filtered results, for which all matches with a MapQ of less than 25 are discarded. The sensitivity reported is the number of correctly mapped reads divided by the total number of reads. The PPV (Positive Predictive Value) is the number of correctly mapped reads divided by the number of reported matches. The elapsed user time is reported when running each program on an Intel(R) Xeon(R) E7450 2.40GHz CPU.
Table 1. Analysis of singleend data simulated with ART
Table 2. Analysis of pairedend data simulated with ART
Table 3. Analysis of simulated biased singleend data
Figure 1. Comparison of the sensitivity of BWAPSSM, BWA, BWAMEM and Bowtie2 after applying MapQ filtering on the singleend and pairedend data sets simulated with ART. After filtering the results on mapping quality all the mappers shown above have a PPV above 0.99. The sensitivities and PPVs are listed in Tables 1 and 2. Bowtie and GEM are excluded as they do not provide MapQ scores.
Figure 2. Comparison of the sensitivity of BWAPSSM, BWA, BWAMEM and Bowtie2 for the three simulated biased data sets after applying MapQ filtering. After filtering the results on mapping quality all the mappers shown above have a PPV above 0.97, except for BWAMEM which has a PPV of 0.868 on the P. falciparum data set. For the PARCLIP data we show the sensitivity of BWAPSSM^{PC} using the TtoC transition model, and for the ancient DNA data the sensitivity of BWAPSSM^{A} using the ancient DNA damage model is shown. The sensitivities and PPVs are listed in Table 3. Bowtie and GEM are excluded as they do not provide MapQ scores.
Unbiased reads
To test the baseline performance of BWAPSSM, we simulated reads using three different simulation programs. The first, ART [37] simulates reads using an error profile particular to the sequencing technology being simulated (Illumina Genome Analyzer II, in our case). The second, WGSIM [38], simulates reads with a uniform error probability. The third, MASON [39], uses variable, position dependent qualities drawn from normal distributions with a specified mean and standard deviation for the start and end position. ART generated the lowest quality reads, followed by MASON, followed by the high quality WGSIM simulated reads.
As expected, BWAPSSM performs best on the lowquality data set (Table 1) and slightly worse on the high quality reads (Additional file 1: Table S2 and S4). It performs comparatively better on the shorter reads than on the longer. This is likely explained by the fact that we limit the size of the heap (and consequently the branching) and thus miss more hits simply because we discard a proportionally larger portion of the search space for the longer reads. Filtering according to mapping quality improves the PPV and reduces the sensitivity.
The performance of BWAPSSM really stands out when considering high quality alignments (as reported by the MapQ score) of ARTsimulated reads (Figure 1). BWAPSSM achieves the greatest sensitivity of the tested aligners which all have a PPV greater than 99%. For reads of length 36 and 50, BWAPSSM performs better across all quality values, whereas for the longer reads BWAMEM reports more lower quality mappings (Figure 3). When ignoring the quality of the resulting mappings, Bowtie returns more results with the caveat that they are more likely to be incorrect. The running time among the aligners varies within an order of magnitude with Bowtie consistently being the fastest while the slowest depended on the length and quality of the reads.
Figure 3. Sensitivity as a function of PPV for BWAPSSM, BWA, BWAMEM and Bowtie2 using singleend ARTsimulated data. For short reads of length 36 (a) and 50 (b) BWAPSSM shows greater sensitivity than the other mappers at similar PPV. For reads of length 76 (c) the performance of BWAPSSM and BWAMEM is similar, while for reads of length 100 (d) BWAMEM has slightly higher sensitivity than BWAPSSM at similar PPV. The curves for each mapping program were obtained by filtering for varying mapping qualities. The results are based on the simulations shown in Table 1. Bowtie and GEM are excluded as they do not provide MapQ scores.
For the longer higher quality reads, the performance of BWAPSSM lags slightly behind the other aligners for all PPV values (Additional file 1: Figure S2 and S3). These results are not unexpected as using quality values which are largely irrelevant should not improve the performance. Furthermore, some of the tradeoffs employed to improve the performance for lowquality and biased reads impede the performance for high quality data. While disadvantageous, this is not the targeted type of data for which this approach was designed and the other aligners already serve this niche adequately.
GEM reports all the potential hits found and classifies them according to the edit distance from the genome. While this approach leads to a greater overall sensitivity, it also makes it difficult to assign a confidence value to a particular alignment. The strength of this mapper lies in aligning long insertion/deletion prone reads, two qualities which are conspicuously absent from the presented benchmark data sets. As such, the performance of GEM is presented in the data tables simply as a point of a comparison for this different class of read mappers.
For pairedend data (Table 2 and Additional file 1: Table S3), the situation is similar but more pronounced. Low quality reads are readily aligned with high accuracy by BWAPSSM whereas high quality reads present a greater challenge. Due to the use of (nearly) default parameters for each aligner, BWA performs extremely poorly on the longer low quality reads. The situation is somewhat reversed for the high quality reads where BWAPSSM finds slightly less hits in a longer amount of time than Bowtie2 and BWA. The results presented, of course, depend greatly upon the parameters chosen for each mapper (the default). Exploring the potential parameter space for each program is an overwhelming task which is often guided by the data to be aligned. The results presented here are merely meant to be a cross section of the potential capabilities of each aligner, corresponding to a roughly comparable (within an order of magnitude) running time.
Ancient DNA
By specifying a probability for each base at each position, it is possible to include additional information in the alignment. Ancient DNA is fragmented and degraded in various ways, leading to specific biases and errors in the data. The dominant error is the deamination of cytosines into uracils (CtoU), which will be interpreted as thymines in the sequencing step [40]. This leads to an excess of CtoT or GtoA mismatches, depending on the strand being sequenced. This is most significant in the ends of the reads and decreases rapidly towards the center [15].
PSSMs were simulated using the damage model specified by Orlando et al.[15], see Methods. The results (Table 3b) show that BWAPSSM with a PSSM modelling the simulated damage gives slightly higher sensitivity and PPV than without a damage model. The sensitivity is slightly higher than BWA while the run time is roughly the same. BWAPSSM was able to find more hits than BWA, Bowtie and Bowtie2 even without a specialized PSSM.
When mapping real ancient DNA and filtering on MapQ, the results mostly reflect the simulated data (Table 4a). The use of a PSSM led to the recovery of more matches than without one. While the difference is not large, the number of reads one might expect to be damaged is rather low in comparison to the total number of reads present. Hence, a modest increase in actual numbers can reflect a greater increase in relative terms. Furthermore, if the results reflect the simulated data, then the expected PPV of the filtered results should be higher than for the other mapping tools. Again, BWAPSSM without a specialized PSSM provides an increase in filtered matches compared to BWA and Bowtie2.
Table 4. Analysis of real singleend data
PARCLIP data
Sequencing data from PARCLIP experiments is very prone to TtoC transitions due to the incorporation of 4SUcontaining nucleobases and their crosslinking to the bound protein. The locations of such transitions indicate where an RNA molecule is bound by a protein [41]. The increased probability of a TtoC mismatch is readily encoded in a PSSM and incorporated into the mapping by BWAPSSM.
To examine the efficacy of providing this extra information, reads were simulated with a 11% TtoC rate. The results (Table 3a) show that the use of a TtoC transition model improves both the unfiltered and filtered sensitivity. Without such a model, BWAPSSM performs only slightly better than BWA. Introducing the error model increases the filtered sensitivity by nearly 14% over the previous best aligner (BWA) for qualityfiltered mappings. This advantage, however, is not limited to high quality mappings and can be seen across all reported mapping qualities (Figure 4). Even for the high quality data sets, where the use of PSSM was more of a hindrance for aligning unbiased reads, introducing an error model leads to greatly improved sensitivity in aligning simulated PARCLIP data (Additional file 1: Table S2 and S4).
Figure 4. Sensitivity as a function of PPV for BWAPSSM, BWA, BWAMEM and Bowtie2 using singleend PARCLIP data simulated using ART. The curves for each mapping program were obtained by filtering for varying mapping qualities. The top line for BWAPSSM (PC) includes the PARCLIP model. The results are based on the PARCLIP simulation shown in Table 3a. Bowtie and GEM are excluded as they do not provide MapQ scores.
To support the simulated data, real data was obtained from a PARCLIP study investigating the binding of RNA to the HuR protein [42]. The statistics for the mapping of the real data (Table 4b) roughly reflects those for the simulated data. Notice the greatly reduced number of matches after filtering for all the mappers, which is the result of shorter reads leading to more ambiguous matches, and a base distribution unlike that of the actual genome (due to the ATrich regions that were being sequenced). Nevertheless, BWAPSSM finds more matches than any of the other programs at a PPV value greater than 0.95 for all types of simulated data.
AT rich data
To test the effect of the background model in BWAPSSM, we simulated Illumina reads of length 100nt using ART [43] from the P. falciparum genome [33], which has an AT content of more than 80%. In BWAPSSM, the PSSMs were constructed using a background model, q(g), that matches the base composition of the reference genome, while the other mappers do not consider this base composition. We see that all mappers except BWAMEM have similar PPVs for both unfiltered and filtered reads (Table 3c). Among the mappers that provide a quality score, BWAPSSM obtains the highest sensitivity. For filtered reads BWAMEM has the next highest sensitivity, however the PPV for BWAMEM is significantly lower than for the other mappers. Bowtie2 has a marginally lower unfiltered sensitivity than BWAPSSM, however, for the filtered reads the sensitivity of Bowtie2 is notably lower. While BWA has a considerably lower sensitivity than the other mappers, Bowtie has the highest sensitivity, which is obtained at the cost of slightly reduced PPV compared to the filtered PPV of BWAPSSM.
Random matches from contaminating reads
We examined how well the mappers can filter out random short matches. To obtain random matches, we simulated short reads of different lengths from the E. coli genome [44] using ART [43] and mapped them to the human genome. From Figure 5 we observe that BWAPSSM, BWA and Bowtie map similar fractions of reads when not applying a quality filter, while this fraction in general is smaller for Bowtie2. However, for quality filtered reads BWAPSSM maps at worst less than 2% of the reads, which is less than any of the other mappers except BWAMEM, though from read length 20nt the curves for Bowtie2 and BWAPSSM are coincident. In comparison, BWA maps more than 25% of the reads at worst after quality filtering. GEM maps approximately the same fraction of reads as Bowtie2 does after quality filtering and BWAMEM cannot map any reads at the given lengths. In conclusion, we see that BWAPSSM has similar efficiency as BWA and Bowtie in mapping short reads, however the quality score (posterior probability) reported by BWAPSSM allows for filtering out random matches at least as efficiently as Bowtie2.
Figure 5. Mapping simulated E. coli reads to the human genome. The fraction of all E. coli reads that are mapped is shown as a function of the read length. For each length, 100,000 Illumina reads were simulated using ART [43]. The dotted lines represent the fraction of unfiltered reads, while the full lines are the fraction of quality filtered reads. For Bowtie and GEM curves are only shown for unfiltered reads. The curves for BWAMEM are hardly visible as BWAMEM does not map any reads.
Xeno mapping
If no reference genome is available for a set of reads, one can try to map the reads to a closely related species. This task is called cross species mapping or xeno mapping [31]. Inspired by the work of Frith et al.[31], we examine how well the programs can map real reads from D. melanogaster using the closely related D. simulans genome as reference. We consider three sets of D. melanogaster reads obtained from the NCBI SRA database: a set of short reads (length 36nt, ID SRR001981), a set of long reads with low quality (length 76, ID SRR023647) and a set of long reads with high quality (length 95, ID SRR516029). All data sets are based on the Illumina platform and low quality bases in the end of the reads are trimmed using the AdapterRemoval tool [45].
To gauge the performance of the mappers, reads are mapped to both the D. simulans and D. melanogaster genome using each mapper. For each tool, we consider all reads that map to the genomes with a mapping quality above a given threshold, and we compare the mapping positions in the two genomes based on the UCSC whole genome alignment of D. simulans and D. melanogaster. If the mapping regions overlap in the whole genome alignment we say the read is consistently mapped. If the read maps to a region in D. melanogaster that is aligned to D. simulans, but the mapping regions do not overlap, we say the read is inconsistently mapped. For the xeno mapping experiments we calculate the sensitivity as the number of consistently mapped reads divided by the total number of reads, and the PPV is calculated as the number of consistently mapped reads divided by the number of consistently and inconsistently mapped reads.
The flexibility of BWAPSSM allows us to include an evolutionary model that reflects the substitution level between the two Drosophila genomes when we map the reads to D. simulans (see Methods). For comparison we also map the reads to D. simulans with BWAPSSM using only the standard error model. In both cases, we map the reads to D. melanogaster using only the standard error model.
Due to the setup of this experiment, we only consider mappers that provide MapQ scores. The results are shown in Table 5 and Figure 6. In Table 5 we used a MapQ threshold of 25 and in Figure 6 the threshold was varied between 0 and 200.
Table 5. Analysis of the xeno mapping experiment
Figure 6. Sensitivity as a function of PPV for BWAPSSM, BWA, BWAMEM and Bowtie2 for the xeno mapping experiment. Singleend Illumina reads from D. melanogaster are mapped to the D. simulans genome using the five mappers. Plots are shown for short reads (a), long low quality reads (b) and long high quality reads (c). The curves for each mapping program were obtained by filtering for varying mapping qualities. The results are based on the data in Table 5. Bowtie and GEM are excluded as they do not provide MapQ scores.
We see that BWAPSSM with the evolutionary model generally performs better than all the other programs for the short reads and the long low quality reads (Figure 6a and 6b). BWAPSSM has higher sensitivity for any PPV value except in the high PPV end of the curves. For the short reads, BWAMEM has slightly better sensitivity at PPV between 0.991 and 0.992, and for the long low quality reads BWA and Bowtie2 can obtain slightly higher PPV than BWAPSSM in the low sensitive end of the curves. For the long high quality reads (Figure 6c) we see that BWAMEM performs best overall, while Bowtie2 can obtain marginally higher PPV than BWAMEM. BWAPSSM can obtain the next highest sensitivity with the evolutionary model, however all the other mappers, including BWAPSSM without the evolutionary model, can obtain higher PPVs.
We see a similar picture when we consider the filtered results in Table 5. While the mappers have similar PPVs on all the data sets for a fixed MapQ threshold, BWAPSSM has the highest sensitivity on the short and long low quality reads. On the long high quality reads BWAMEM has the highest sensitivity and BWAPSSM the next highest. We also see that the increased sensitivity of BWAPSSM come at the cost of increased running time. In the worst case (Table 5b) BWAPSSM is nearly 7.5 times slower than the fastest method (Bowtie2), however in this case BWAPSSM maps over 40% more reads and have a marginal higher PPV than Bowtie2.
Finally we also see that the performance of BWAPSSM generally improves when including an evolutionary model. On the short reads the sensitivity of BWAPSSM is improved at all PPVs and on the two long read data sets BWAPSSM can generally obtain higher sensitivity with the evolutionary model. However, BWAPSSM can obtain slightly higher PPVs using only the standard error model on the long reads.
Conclusions
We have presented BWAPSSM, a novel method for using positionspecific scoring matrices to provide quality aware short read mapping. The algorithm for PSSM scoring is based on BWA’s mapping algorithm. This method can be applied to other PSSM applications, lowering the number of genomic locations that need to be evaluated and increasing the efficiency of PSSM searches.
There are many advantages in the probabilistic approach taken here, in which the probability of a match being correct is estimated using prior probabilities that may be specific for the experiment. We have shown how it is possible to model the evolutionary differences between the sample and reference genome, sequencing errors based on e.g. quality scores, experimentspecific base substitutions, and contamination in the sample.
It is worth emphasizing the importance of the prior probability of a match. In all experiments there is a possibility that the read sequence is due to contamination or that it does not appear in the reference genome for other reasons (e.g. sequences private to the sampled individual). In some experiments the majority of reads may be contamination, as is often the case with ancient samples. Especially when it comes to short reads, there is a great risk that contaminating sequences will be mapped to the reference genome, as was illustrated with the mapping of E. coli sequences to the human genome. This may be taken into account by changing the value for the prior probability of a match.
Some of the important extensions implemented in BWAPSSM include the ability to use longer genomes and to handle the forward and reverse search in a single concatenated index, the control of the direction of the search tree traversal based on the underlying PSSM, and the use of an interval heap for focusing on the most likely region of the search tree while discarding lower scoring branches. Furthermore, BWAPSSM can capture platform and sample specific biases in the nucleotide composition, making the tool highly sensitive and adaptable to specialized applications such as ancient DNA, PARCLIP, or biased genomes.
Because matches can be prioritized by a continuous match score, the number of candidate matches stored in the heap can be significantly reduced compared to other methods that use the number of mismatches to prioritize. For this reason the PSSM implementation runs with a speed comparable to the other mappers.
Methods
A probabilistic model for nextgeneration DNA reads
In this section we present a detailed description of the probabilistic model of DNA reads. In the simple model we have two distinct processes that can affect the base called by the sequencer compared to the base actually present in the genome of interest: Actual mutation with respect to the reference genome from base g to base a, and a miscalled base x by the sequencing machine given base a in the sample.
We will assume that the observed base x is independent of the base g in the genome given the base a in the sample. This conditional independence assumption can be expressed by the Bayesian network in Figure 7A. Using this assumption we can calculate the probability of observing the base x in the read given the base g in the genome , where P(ag) is the mutation model and P(xa) is the model of sequencing errors. To calculate the entries in the positionspecific scoring matrix (PSSM) we need P(gx), and the conditional independence assumptions also give that
Figure 7. Bayesian networks describing the independence assumptions between nucleotides. The nodes represent random variables, while the arrows describe the conditional independencies between the variables. The left figure (A) shows the base in the reference genome g, the base in the sample a, and the observed base x, while the right figure (B) also includes the alternated (biased) base b.
P(ga) can be calculated from an evolutionary model. Here we use a simple one, where there is a fixed probability p_{0} for a mutation which is independent of the base, that is
For this to make sense, we implicitly assume a uniform base distribution in the genome.
For sequencing error we also assume a simple model, where the probability of an error p_{e} = 10^{Q/10} is given by the quality score Q. From this we can calculate the probability of the sample containing base a given that the sequencer is calling base x
Note that it is in principle straightforward to use more sophisticated models both for evolution and for sequencing errors and include these in the PSSM. An example of such a model is presented in the next section.
Evolutionary model for xeno mapping
For xeno mapping we adopt the evolutionary model by Frith et al.[31]. According to this model we write
where p_{s} is the probability of a substitution and p_{t} is the probability of a transition given that a substitution has happened. For mapping D. melanogaster reads to the D. simulans genome, we used the substitution and transition frequencies observed in the UCSC alignment of the two genomes as given by Frith et al.[31], that is and .
A probabilistic model for biased DNA reads
In a given data set, we might expect specific biases to affect the frequency with which particular modifications of the DNA sequence occur. This is a wellknown phenomenon in many different types of data. If we include a specific model of these biases, we now have three processes that can affect the base called by the sequencer, where the additional process is changing the original base a into base b due to the type of DNA sample. The independence assumptions for these three processes is expressed by the Bayesian network in Figure 7B.
Using these assumptions the probability of genomic base g given that base x is reported by the sequencing machine is (as before)
where the term P(ab) describes the bias. However, in the examples we will describe here, it is more natural to estimate P(ba). By performing the asum in equation (6) and using the conditional independence assumptions, we obtain the expression
Using Bayes’ rule and the sum rule we can write the first term in the equation above as
Assuming the simple models for evolution P(ag) and sequencing error P(bx) from the previous section, all we need in order to calculate P(gx) is an expression for the bias model P(ba) and trivially a prior for the genomic base distribution P(g). In the following sections we will consider two such bias models.
Ancient DNA model
In ancient DNA we often observe damage (miscoding lesions), especially CtoU changes due to deamination, which translate into CtoT in the sequenced reads (or GtoA depending on the strand being sequenced). The damage model P(ba) describing these alterations is as follows. We allow CtoT and GtoA changes and the probabilities of these changes depend on the position in the read, with 3’ and 5’ positions having the highest rate of damage. Let i denote the position in the read, then we specify the probabilities p_{i}(ba) = 0 except for the following cases:
The actual damage rates, γ_{i} and δ_{i}, are obtained from Orlando et al.[15].
For ancient DNA, the evolutionary model P(ag) is obtained by assuming a probability m_{1} for transitions (substitutions between C and T, or between A and G) and a smaller probability m_{2} for transversions (all other substitutions).
PAR CLIP model
While ancient DNA has a pattern of alterations where it is mainly bases in the 5’ and 3’ ends of a read that are modified, other types of data exhibit different modifications. PARCLIP data, for example, has consistent TtoC changes. These changes can be handled in the same manner as for ancient DNA damage described above by modifying the conditional probabilities P(ba) to construct a model P(xg) incorporating the particular modification pattern.
Practical calculation of the PSSM
Once the evolutionary model, the bias model, and the error model have been formulated, P(gx), and thus the corresponding PSSM scores, can be calculated for all possible combinations of base and quality score (and read position for the damage model). Therefore the conversion of a read with quality scores to a PSSM can be done very fast with a table lookup.
Including indels in the alignment
In equation (2) we only considered alignments without indels. If we assume that the indel pattern a is independent of the starting position ℓ we can write the posterior probability for a match as
Again we will assume that all unaligned positions in the genome are i.i.d. with the same distribution in both the match and mismatch model, which means that . Consequently we can write the posterior probability as
where the sum in the denominator runs over all positions ℓ^{′} in the genome and all possible alignments a^{′} starting at position ℓ^{′}.
For the indel pattern we will use a linear gap penalty model, which means that the gap lengths follow a geometric distribution [46]. If we assume that insertions and deletions happen independently at each position in x, we can write the probability of a given alignment in terms of the total number of insertions a_{i} and deletions a_{d} in the sequence
where is the probability of an insertion and is the probability of a deletion. Here ρ_{i} and ρ_{d} are the insertion and deletion penalty log scores, respectively. In practice this is implemented as illustrated in Algorithm 1 and the default value for ρ_{i} and ρ_{d} is 17.
Other BWA modifications
In addition to allowing the use of positionspecific scoring matrices for alignment, we introduced two other modifications to speed up the alignment method and to allow for the indexing of larger genomes.
64Bit indexing
The existing version of BWA has been modified to use 64bit indexing by changing the underlying data structures. This modification allows for the sequencing of genomes greater than 4 billion base pairs in length. Due to the limitations of the SAM/BAM format [47], however, these genomes must be composed of subsequences (i.e. chromosomes) smaller than 4 gigabases each. Using 64bit indexing leads to an index approximately 32% larger than its 32bit counterpart.
Single index
Traditionally, BWA uses two indexes: One of the forward and one of the reverse of the reference genome. Searches are performed by using the original read and a complemented read and searching the forward index and reverse index with the corresponding read. Combining the forward and reverse indexes into a single structure and searching using only the original read improves the running time by consolidating prefixes common to both the forward and reverse indexes (see Figure 8). Inspired by our work, a single 64bit index has also been included in the 0.6 release of BWA.
Figure 8. Comparison of the original BWA alignment method (top) and the modified version using a single index (bottom). The sequence of the short read, ‘TAC’, on the ‘’ strand can easily be found by subtracting the length of the original sequence from its position in the concatenated sequence.
Heap sorting for partial hits
In BWA, the best partial candidate hits are kept sorted by a score computed as a function of the mismatches, insertions and deletions in a heaplike structure. BWAPSSM, in contrast, uses the offset from the best possible score at a particular position as the sorting criterion (Additional file 1: Figure S1). If the score of an alignment candidate (partialhit) drops due to mismatches or insertions/deletions, it moves down the list. While BWA abandons the search when the size of the heap is exceeded, BWAPSSM continues searching while discarding the partial hits with the lowest PSSM score offset. This is made possible by the use of an intervalheap data structure, allowing for the rapid removal, Θ(log n), of either the largest or smallest element [48].
The traditional breadthfirst search of the prefix tree employed by BWA is further modified such that each branch is weighted according to the decrease from the optimal PSSM score for that character. That is, the branch corresponding to a perfectly matched base would have a weight corresponding to 0 since it will contain the best possible score (see Additional file 1: Figure S1). A mismatch branch, however, would have a lower score due to the use of a suboptimal PSSM score. The use of score offsets corresponds to a breadthfirst traversal of the suffix tree whereas using the total PSSM score itself corresponds to a depthfirst traversal.
Availability
BWAPSSM and its source code are freely available through the website http://bwapssm.binf.ku.dk webcite.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
PK and AK developed the method and algorithms. JF and SL contributed to discussions on the method and algorithms. PK developed the software. PK, JF, SL, and AK designed the experiments and analysis. PK and JF performed the experiments and analysis. PK, JF, SL and AK wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
The authors would like thank Simon Rasmussen and Mireya Plass for providing the error model used for the PARCLIP data as well as valuable feedback and bug reports. This work was supported by the Novo Nordisk Foundation, the Danish National Research Foundation, and the Danish Council for Strategic Research. PK would like to thank Novo Nordisk and Novozymes for funding via the Novo Scholarship Program. JF is supported by an individual postdoc grant from the Danish Council for Independent Research  Natural Sciences. SL is supported by a Marie Curie International Outgoing Fellowship within the 7th European Community Framework Programme.
References

Metzker ML: Sequencing technologies  the next generation.
Nat Rev Genet 2010, 11(1):3146.
doi:10.1038/nrg2626
PubMed Abstract  Publisher Full Text 
Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, et al.: The diploid genome sequence of an Asian individual.
Nature 2008, 456:6065. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing.
Nature 2008, 452(7189):872876. PubMed Abstract  Publisher Full Text

Kitzman JO, Mackenzie AP, Adey A, Hiatt JB, Patwardhan RP, Sudmant PH, Ng SB, Alkan C, Qiu R, Eichler EE, Shendure J: Haplotyperesolved genome sequencing of a Gujarati Indian individual.
Nat Biotechnol 2011, 29(1):5963. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by highthroughput sequencing.
Nat Genet 2008, 40(12):14131415. PubMed Abstract  Publisher Full Text

Cloonan N, Grimmond SM: Transcriptome content and dynamics at singlenucleotide resolution.
Genome Biol 2008, 9(9):234. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Levin JZ, Berger MF, Adiconis X, Rogov P, Melnikov A, Fennell T, Nusbaum C, Garraway LA, Gnirke A: Targeted nextgeneration sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts.
Genome Biol 2009, 10(10):115. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Adams IP, Glover RH, Monger WA, Mumford R, Jackeviciene E, Navalinskiene M, Samuitiene M, Boonham N: Nextgeneration sequencing and metagenomic analysis: a universal diagnostic tool in plant virology.
Mol Plant Pathol 2009, 10(4):537545. PubMed Abstract  Publisher Full Text

Petrosino JF, Highlander S, Luna RA, Gibbs RA, Versalovic J: Metagenomic pyrosequencing and microbial identification.
Clin Chem 2009, 55(5):856866. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Simon C, Wiezer A, Strittmatter AW, Daniel R: Phylogenetic diversity and metabolic potential revealed in a glacier ice metagenome.

Green RE, Krause J, Ptak SE, Briggs AW, Ronan MT, Simons JF, Du L, Egholm M, Rothberg JM, Paunovic M, Paabo S: Analysis of one million base pairs of Neanderthal DNA.
Nature 2006, 444:330336. PubMed Abstract  Publisher Full Text

Miller W, Drautz DI, Ratan A, Pusey B, Qi J, Lesk AM, Tomsho LP, Packard MD, Zhao F, Sher A, Tikhonov A, Raney B, Patterson N, LindbladToh K, Lander ES, Knight JR, Irzyk GP, Fredrikson KM, Harkins TT, Sheridan S, Pringle T, Schuster SC: Sequencing the nuclear genome of the extinct woolly mammoth.
Nature 2008, 456:387390. PubMed Abstract  Publisher Full Text

Rasmussen M, Li Y, Lindgreen S, Pedersen JS, Albrechtsen A, Moltke I, Metspalu M, Metspalu E, Kivisild T, Gupta R, Bertalan M, Nielsen K, Gilbert MT, Wang Y, Raghavan M, Campos PF, Kamp HM, Wilson AS, Gledhill A, Tridico S, Bunce M, Lorenzen ED, Binladen J, Guo X, Zhao J, Zhang X, Zhang H, Li Z, Chen M, Orlando L, et al.: Ancient human genome sequence of an extinct PalaeoEskimo.
Nature 2010, 463:757762. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rasmussen M, Guo X, Wang Y, Lohmueller KE, Rasmussen S, Albrechtsen A, Skotte L, Lindgreen S, Metspalu M, Jombart T, Kivisild T, Zhai W, Eriksson A, Manica A, Orlando L, De La Vega FM, Tridico S, Metspalu E, Nielsen K, AvilaArcos MC, MorenoMayar JV, Muller C, Dortch J, Gilbert MT, Lund O, Wesolowska A, Karmin M, Weinert LA, Wang B, Li J, et al.: An Aboriginal Australian genome reveals separate human dispersals into Asia.
Science 2011, 334:9498. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Orlando L, Ginolhac A, Raghavan M, Vilstrup J, Rasmussen M, Magnussen K, Steinmann KE, Kapranov P, Thompson JF, Zazula G, Froese D, Moltke I, Shapiro B, Hofreiter M, AlRasheid KAS, Gilbert MTP, Willerslev E: True singlemolecule DNA sequencing of a pleistocene horse bone.
Genome Res 2011, 21(10):17051719.
doi:10.1101/gr.122747.111
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Schubert M, Ginolhac A, Lindgreen S, Thompson J, ALRasheid K, Willerslev E, Krogh A, Orlando L: Improving ancient dna read mapping against modern reference genomes.
BMC Genomics 2012, 13(1):178.
doi:10.1186/1471216413178
PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text 
Hamada M, Wijaya E, Frith MC, Asai K: Probabilistic alignments with quality scores: an application to shortread mapping toward accurate SNP/indel detection.
Bioinformatics 2011.
doi:10.1093/bioinformatics/btr537

Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores.
Genome Res 2008, 18(11):18511858.
doi:10.1101/gr.078212.108
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program.
Bioinformatics 2008, 24(5):713714.
doi:10.1093/bioinformatics/btn025
PubMed Abstract  Publisher Full Text 
Burrows M, Wheeler D: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. 1994

Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome.
Genome Biol 2009, 10(3):25.
doi:10.1186/gb2009103r25
BioMed Central Full Text 
Langmead B, Salzberg SL: Fast gappedread alignment with Bowtie 2.
Nat Methods 2012, 9(4):357359.
doi:10.1038/nmeth.1923
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Li R, Yu C, Li Y, Lam TWW, Yiu SMM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment.
Bioinformatics (Oxford England) 2009, 25(15):19661967.
doi:10.1093/bioinformatics/btp336
Publisher Full Text 
Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform.
Bioinformatics (Oxford, England) 2009, 25(14):17541760.
doi:10.1093/bioinformatics/btp324
Publisher Full Text 
Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermüller J: Fast mapping of short sequences with mismatches, insertions and deletions using index structures.
PLoS Comput Biol 5(9):1000502.
doi:10.1371/journal.pcbi.1000502

Liu Y, Schmidt B, Maskell DL: CUSHAW: a CUDA compatible short read aligner to large genomes based on the burrows–wheeler transform.
Bioinformatics 2012, 28(14):18301837.
doi:10.1093/bioinformatics/bts276
PubMed Abstract  Publisher Full Text 
Liu CM, Wong T, Wu E, Luo R, Yiu SM, Li Y, Wang B, Yu C, Chu X, Zhao K, Li R, Lam TW: SOAP3: Ultrafast GPUbased parallel alignment tool for short reads.
Bioinformatics 2012.
doi:10.1093/bioinformatics/bts061

Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, AltafUlAmin M, Ogasawara N, Kanaya S: Sequencespecific error profile of Illumina sequencers.
Nucleic Acids Res 2011, 39(13):90.
doi:10.1093/nar/gkr344
Publisher Full Text 
Siragusa E, Weese D, Reinert K: Fast and accurate read mapping with approximate seeds and multiple backtracking.
Nucleic Acids Res 2013, 41(7):7878. Publisher Full Text

MarcoSola S, Sammeth M, Guigo R, Ribeca P: The GEM mapper: fast, accurate and versatile alignment by filtration.
Nat Meth 2012, 9(12):11851188.
doi:10.1038/nmeth.2221
Publisher Full Text 
Frith MC, Wan R, Horton P: Incorporating sequence quality data into alignment improves DNA read mapping.
Nucleic Acids Res 2010, 38(7):100. Publisher Full Text

Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, Zavolan M: A quantitative analysis of CLIP methods for identifying binding sites of RNAbinding proteins.
Nat Methods 2011, 8(7):559564. PubMed Abstract  Publisher Full Text

Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, et al.: Genome sequence of the human malaria parasite Plasmodium falciparum.
Nature 2002, 419(6906):498511. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ewing B, Green P: Basecalling of automated sequencer traces using phred. II Error probabilities.
Genome Res 1998, 8:186194. PubMed Abstract  Publisher Full Text

Beckstette M, Homann R, Giegerich R, Kurtz S: Fast index based algorithms and software for matching position specific scoring matrices.
BMC Bioinformatics 2006, 7(1):389.
doi:10.1186/147121057389
PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text 
Li H: Aligning sequence reads, clone sequences and assembly contigs with BWAMEM. ArXiv eprints. 2013, 1303.3997

Huang W, Li L, Myers JR, Marth GT: ART: a nextgeneration sequencing read simulator.
Bioinformatics (Oxford, England) 2012, 28(4):593594.
doi:10.1093/bioinformatics/btr708
Publisher Full Text 
Li H: wgsim  Read simulator for next generation sequencing.
2011.

Holtgrewe M: Mason–a read simulator for second generation sequencing data. Technical Report FU Berlin. 2010

Willerslev E, Cooper A: Review paper. ancient dna.
Proc R Soc B: Biol Sci 2005, 272(1558):316. Publisher Full Text

Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano Jr M, Jungkamp AC, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T: Transcriptomewide identification of rnabinding protein and microrna target sites by parclip.
Cell 2010, 141(1):129141. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, Zavolan M: A quantitative analysis of clip methods for identifying binding sites of rnabinding proteins.
Nat Methods 2011, 8(7):559564. PubMed Abstract  Publisher Full Text

Huang W, Li L, Myers J, Marth G: ART: a nextgeneration sequencing read simulator.
Bioinformatics 2012, 28:5934.
doi:10.1093/bioinformatics/btr708
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, ColladoVides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y: The complete genome sequence of escherichia coli k12.
Science 1997, 277(5331):14531462. PubMed Abstract  Publisher Full Text

Lindgreen S: Adapterremoval: Easy cleaning of next generation sequencing reads.
BMC Res Notes 2012, 5(1):337. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press; 1998.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools.
Bioinformatics 2009, 25:20782079. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

van Leeuwen J, Wood D: Interval heaps.
Comput J 1993, 36(3):209216.
doi:10.1093/comjnl/36.3.209
Publisher Full Text