Department Secondary Analysis, Pacific Biosciences, 1005 Hamilton Rd, CA, Menlo Park, USA

Department of Mathematics, University of California, San Diego, 9500 Gilman Dr, CA, La Jolla, USA

Abstract

Background

Recent methods have been developed to perform high-throughput sequencing of DNA by Single Molecule Sequencing (SMS). While Next-Generation sequencing methods may produce reads up to several hundred bases long, SMS sequencing produces reads up to tens of kilobases long. Existing alignment methods are either too inefficient for high-throughput datasets, or not sensitive enough to align SMS reads, which have a higher error rate than Next-Generation sequencing.

Results

We describe the method BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error. The method is benchmarked using both simulated reads and reads from a bacterial sequencing project. We also present a combinatorial model of sequencing error that motivates why our approach is effective.

Conclusions

The results indicate that it is possible to map SMS reads with high accuracy and speed. Furthermore, the inferences made on the mapability of SMS reads using our combinatorial model of sequencing error are in agreement with the mapping accuracy demonstrated on simulated reads.

Background

The first step in a resequencing study is to map reads from a sample genome onto a reference, accounting for sample variance and sequencing error. An accurate and sensitive approach is to use Smith-Waterman

Sequencing methods based on single molecule sequencing (SMS) also produce large datasets that have high computational demands for mapping. SMS datasets do not have the length limitations of NGS or Sanger sequencing, but have a higher number of errors, and the errors are primarily insertions and deletions rather than substitutions. Thus, mapping methods created for NGS sequencing do not extend well to SMS reads. A recent study using the PacBio

Many alignment methods in similar application areas share related algorithmic approaches or data structures that are tailored to optimize the particular targeted application. The relationship between many existing alignment methods

An illustration of relationships between alignment methods

**An illustration of relationships between alignment methods.** The applications / corresponding computational restrictions shown are (green) short pairwise alignment / detailed edit model; (yellow) database search / divergent homology detection; (red) whole genome alignment / alignment of long sequences with structural rearrangements; and (blue) short read mapping / rapid alignment of massive numbers of short sequences. Although solely illustrative, methods with more similar data structures or algorithmic approaches are on closer branches. The BLASR method combines data structures from short read alignment with optimization methods from whole genome alignment.

Advances in isolation and detection of single molecules and reactions have enabled SMS methods

We propose aligning SMS reads with high indel rates to genomes as follows. First, find clusters of short exact matches between the read and the genome using either a suffix array or BWT-FM index

We implemented our method in a program called BLASR (Basic Local Alignment with Successive Refinement), which combines the data structures used in short read mapping with alignment methods used in whole genome alignment. A BWT-FM index or suffix array of a genome is queried to generate short exact matches that are clustered and give approximate coordinates in the genome for where a read should align. A rough alignment is generated using sparse dynamic programming on a set of short exact matches in the read to the region it maps to, and a final detailed alignment is generated using dynamic programming within an area guided by the sparse dynamic programming alignment.

Results and discussion

Our results are broken down into two sections; in the first, we examine characteristics of PacBio

Mapping feasibility

Our strategy to map SMS reads is to locate a relatively small number of candidate intervals where the read may map and then use detailed pairwise alignments to determine the best candidate. The candidate intervals may be found by locating all exact matches between the read and the genome, and then finding dense clusters of exact matches (anchors) in spans of similar length and the same (or reverse complement) order and orientation in both the genome and read, as described in detail in Methods. The feasibility of the method depends on the balance of having enough anchors to detect the correct interval to align a read to, vs. having so many anchors that clustering takes a prohibitive amount of time.

One approach to limiting the number of anchors is to limit to a set of anchors of low multiplicity in the genome; this is commonly done by using longer anchors. When the sequencing error rate is

The distribution of lengths of error-free segments of reads

**The distribution of lengths of error-free segments of reads.** The line fitted to the points weighted by frequency has slope −0.071, corresponding to a geometric distribution with parameter 0.848, in close agreement with the 84.5% accuracy of the dataset used. Over 95% of segments are of length 20 less.

We may model SMS sequencing as a process that generates a series of error-free words with a geometric length distribution, each separated by a single erroneous base. With this model, it is possible to determine how many words must be sequenced until there is a high probability that a word of length ^{
K
}
^{
K
}, where **waiting length** is the corresponding number of bases for

The waiting lengths for words of size 15, 20, and 25 are shown for **anchors**.

Waiting length to sequence a word of length ** ≥k**at

**Waiting length to sequence a word of length****at****.** The waiting lengths to sequence a word of length ≥ k at ε = 0.05 at varrying accuracy. This gives an estimate of the number of bases required to sequence before having an error free stretch that may serve as an alignment anchor.

Other alignment methods such as Gapped BLAST
**NumConfigurations****
(M,N,K,L)
** as the number ways to distribute the positions of M errors when reading from the template such that there are at least N maximal substrings of length ≥ K not interrupted by error. In Appendix 1, we compute this using generating functions, allowing us to apply the result across the read lengths and error profiles found in SMS sequencing. Weese et al.

Assuming all permutations of errors are equally likely,

Values for

**Values for****for parameters similar to SMS sequencing.** The fraction of configurations allowing at least N anchors of length 15, 20, and 25 for N between 0 and 50 are shown for a 1000 base read when placing (**A**) 200, (**B**) 150, (**C**) 100, and (**D**) 50 errors.

When a read is sampled from a repeat in the genome, there are likely to be many dense clusters of anchors mapping the read across the genome. Assuming the repeat is divergent, it is necessary to perform a detailed alignment (Smith-Waterman) to all intervals containing dense clusters of anchors in order to distinguish the correct mapping location from other repeats. For copies of a repeat such as Alu or LINE in the human genome, the computational demands are too prohibitive to align the read against all instances of the repeat. On the other hand, if only a limited number of mapped locations are aligned in detail, the chance of finding the correct location is small. The similarity of repeats in a genome is typically defined by percent identity from a pairwise alignment of the two sequences
**anchor similarity** of two sequences is the maximum number of fixed-length, non-overlapping, ordered anchors, shared between two sequences, with certain constraints on anchor spacing. If the anchor similarity is S, we also say the two sequences are **
S
**

**Supplementary Text S1.** The supplementary text contains additional implementation details for the anchor similarity method, and description of the empirical model based read simulator.

Click here for file

To characterize the repetitiveness by anchor similarity of sequences in the human genome, we took a sample of 1 million random intervals of length L=1 kb in the genome, and computed anchor similarity of each interval with all other intervals up to length (1 + δ)L = 1150 (assuming an indel rate δ = 0.15) in the rest of genome. We used anchors of lengths 15, 20, and 25. For each interval and anchor length, a histogram is generated for the number of times ≥S-similar intervals are found in the genome. A hypothetical sample sequence with K = 15 may have 50 thousand ≥1-similar intervals in the genome; one thousand ≥2-similar intervals; one hundred ≥3-similar sequences; ten ≥4-similar sequences; and one ≥5-similar sequence. This results in one million histograms (for each anchor length). To summarize these, we examined the cumulative distribution of values of all histograms for ≥1, ≥5, ≥10, and ≥20-similar sequences, as shown in Figure

**-similar sequences measured in the human genome.** 1 million query intervals, each 1000 bases long, were randomly sampled from the genome. Each query interval was searched against the human genome to determine the number of non-overlapping 1000 base intervals in the genome that are ≥**A**) ≥1-similar, (**B**) ≥5-similar, (**C**) ≥10-similar, and (**D**) ≥20-similar to these 1 million query intervals, is shown. Each panel uses minimum anchor lengths

We compared the distribution of values of anchor similarity from the human genome with values of

To gauge the mapability of sequences to various genomes, we simulated reads from

The mapability of simulated sequences from the

**The mapability of simulated sequences from the****,****, and human genomes.** Mapping accuracy is shown on a Phred scale (

As shown in Figure

Mapping benchmarks

We generated three datasets for evaluating mapping speed and accuracy of different aligners on SMS reads (see Table

**Dataset**

**Description**

Pacific Biosciences-

50× coverage of reads simulated from

100 MB of reads simulated from the human genome.

The

**Supplementary Table S1.** Supplementary Table S1 gives the command line parameters used to run the benchmarks.

Click here for file

Statistics of reads from

**Statistics of reads from****O104:H4 produced by the PacBio****sequencing platform.** (

**Method**

**Number of aligned reads**

**Number of aligned bases**

**Run time**

Each method was used to align 48× coverage of reads from

BLASR-SA

94057

230.8 M

20m 54s

BLASR-BWT

94527

230.1 M

33m 57s

BWA-SW

97729

132.4 M

434m 5s

BLAT

99530

181.7 M

4724m 40s

To test the sensitivity and specificity of mapping, reads were simulated using an empirical model (described in Additional file

**Method**

**Correctly mapped**

**Incorrectly mapped**

**Skipped**

**Runtime**

**Memory **

**reads**

**bases**

**reads**

**bases**

**reads**

**footprint**

Reads are simulated from

**
E. coli
**

BLASR-SA

108789

266.5M

229

0.38M

3766

48m 18s

202 MB

BLASR-BWT

108795

265.3M

259

0.45M

3604

59m 39s

46 MB

BWA-SW

111192

261.9M

1835

0.91M

3005

223m 57s

190 MB

**
H. sapiens
**

BLASR-SA

41726

102.3M

1074

1.89M

413

92m 26s

14.7 GB

BLASR-BWT

41582

101.7M

1159

1.75M

472

53m 26s

8.1 GB

BWA-SW

40381

96.3M

292

1.16M

1554

105m 24s

4.2 GB

In addition to the information encoding the alignment, BLASR produces a mapping quality value for every alignment. This value represents the PHRED scale probability that the coordinates the read is aligned to in the genome are incorrect, similar to the mapping quality values produced by Maq

Mapping quality values of reads simulated from the human genome

**Mapping quality values of reads simulated from the human genome.** (**A**) The frequency of quality values for alignments of 10^{6}simulated 1000, 2000, and 3000 base sequences from the human genome. (**B**) The empirical mapping quality values of the alignments.

Conclusion

Methods to produce reads through single molecule sequencing were mostly theoretical a decade ago and are now produced in high throughput on an industrial platform. The different characteristics of the sequences produced by SMS relative to Next Generation sequencing (sequences several orders of magnitude longer than previous technologies, at the expense of a higher error rate concentrated in insertions and deletions), require new computational techniques to be efficiently analyzed. Here, we addressed the problem of mapping SMS reads to a reference genome by first examining the feasibility of mapping SMS reads, and then by benchmarking our new alignment method on reads produced by the PacBio

There are many emerging problems for processing SMS sequences. As the length of the reads produced by SMS increases, the computational problem resembles whole genome alignment more than the read mapping problem. This increases the need to have methods that accurately detect structural rearrangements covered by single reads. Furthermore, with the inevitable exponential increase in sequencing throughput, the current methods will not be sufficient to align SMS reads without a large amount of time or computational resources, and further algorithmic improvements will be necessary. We did not address the issue of using multiple sequence alignment to produce a consensus sequence or variant calls. It has been shown that the additional information one may gain by observing the signal from single-molecule events in real time may indicate DNA modifications such as methylation

Methods

We use a

Overview of the BLASR method

**Overview of the BLASR method.** (**A**) Candidate intervals are found by mapping short, exact matches as shown by colored arrows. Either a suffix array or BWT-FM index of the genome are used to find the exact matches. Intervals are defined over clusters of matches and are ranked; intervals with score 3, 6, and 4 are shown. (**B**) Matches scoring above a threshold are aligned using sparse dynamic programming on shorter exact matches. (**C**) Alignments that have a high-scoring sparse-dynamic programming score are realigned by dynamic programming over a subset of cells defined using the sparse dynamic programming alignment as a guide.

Detecting candidate intervals

The input to the BLASR method is a read _{1},…,_{
R
}; a genome _{1},…,_{
G
}; and a minimum match length,

We use either a suffix array (SA) or BWT-FM index on the genome to query for exact matches, depending on time and space requirements. While some NGS alignment methods such as mrFAST and RazerS match using hash tables on fixed width words (_{
i
}) = _{
i,…,R
}
_{
i
}. We choose a parameter _{
i
} <

Descriptions of the implementation and methods for the Count and Locate queries using suffix arrays are given in

Once the set of anchors

The clusters are assigned a frequency weighted score that is the sum
_{
j
}) is the frequency of the sequence of _{
j
} in the genome, and are ranked by this score. Only the top MAXCANDIDATES clusters are retained (typically 10). The original indexing of clusters by anchor position is replaced by indexing by rank of the frequency-weighted score. The subscript notation is dropped and rank of a cluster is indicated by the superscript. The remaining clusters are denoted

While limiting the number of clusters retained may miss alignments to repetitive regions, filtering clusters on this frequency-weighted score was shown to be highly discriminative in our tests.

Refining alignments

Each cluster
^{FIRST}(^{LAST}) be the anchors with least (greatest) Genome(^{FIRST}) − (1 + ^{FIRST}), and ending position ^{LAST}) + (1 + ^{LAST}) + ^{LAST}))), of length

The read must be quickly aligned to a candidate interval, even if it is many tens of kilobases long. Similar to the method of anchoring the interval to the genome but on a smaller scale, a set of matches are found between the read and the candidate interval. The matches used in SDP are of a short fixed length, ^{SDP} (typically 8–11 bases). Let
^{SDP} that are exact matches between the read and the genome interval _{
s
},…,_{
f
}. Sparse dynamic programming finds the largest subset of anchors

The SDP alignment does not align all bases in a read, and so it is necessary to realign a final time using banded dynamic programming. For long reads with indels, the size of the band used to contain the entire alignment becomes prohibitively large. The set of anchors
^{SDP}centered about the diagonal where there are anchors, as well as a banded alignment of size ^{drift} between anchors where ^{drift} is the off-diagonal distance between adjacent anchors + ^{SDP}.

In addition to the base sequences produced by the PacBio_{
i,j
} in the dynamic programming matrix according to:

The MISMATCHPRIOR and DELETIONPRIOR are PHRED scaled penalties that reflect the global mismatch and deletion rates. In practice, MISMATCHPRIOR is 20 and DELETIONPRIOR is 15.

Mapping quality values

Due to the repetitive nature of genomes, a read often maps with a high alignment score to many locations. It is informative to calculate the probability that the interval a read is mapped to by an alignment is the correct location in the genome. This probability may be interpreted as a

A Bayesian probability technique was presented in

where _{
i,…,i + R−1}) is the probability of observing the read _{
i
} denote the probability that a base in a read is incorrect. Then Pr(_{
i,…,i + R − 1}. When there are insertions and deletions in the sequence, the value Pr(_{
i,…,i + R − 1}) may be computed as

The denominator of Equation 1 gives the marginal probability that the read is observed from anywhere in the genome. Evaluating this full sum is computationally infeasible even for short reads and ungapped alignments. Since the probability of observing a read given a template sequence drops geometrically with divergence, most positions in the genome do not contribute significantly to the sum. For short reads, the sum is approximated in

In BLASR, the mapping quality value is calculated in a similar manner. The sum in Equation 1 is limited to the top ^{2} for number of anchor bases, and count all clusters with more than

Appendix 1

Enumeration of configurations with specified numbers of errors and anchors

In this section, we will show how to explicitly compute NumConfigurations(

Consider a read of length _{1} < _{2} < … < _{
M
} ≤ _{0} = 0 and _{
M + 1} =

For the sake of simplicity, we assume all sequencing errors are of length 1, but this can be generalized to insertions and deletions that change the length of the read.

The error positions split the read into **parts** of sizes _{
i
} − _{
i−1} ≥ 1 for _{
i
} (_{
i
} − 1 matches followed by one mismatch. The last part consists of _{
M + 1} − 1 matches. Note that if there are two consecutive mismatches, there will be a part _{
i
} = 1 corresponding to 0 matches followed by one mismatch.

Part sizes _{
i
} are related to the notation _{
i
} = _{
i
}) to specify the word number. In this section, _{
i
} counts the correct bases and also counts one incorrect base at the end, based on our simplification that all sequencing errors are of length one.

Set _{1},_{2},…,_{
M + 1}). These are positive integers that add up to **strict composition** of

Consecutive errors greater than _{
i
} > **anchors** while consecutive errors shorter than this (**short matches**.

In Figure

Toy example for counting components

**Toy example for counting components.** A read of length

For reads of length 7 with 2 errors, and minimum anchor length 3, the number of compositions with exactly one anchor (allowing it to be any of the parts, via permutations of these compositions) is 6·3 = 18.

For arbitrary values of the parameters, we first compute the number of configurations where all

Let **integer compositions** of _{
M,N,K
}(_{
i
} > _{1},…,_{
N
} > _{
N + 1},…,_{
M + 1} ≤

Note that

The compositions of

• _{1},…,_{
N
} ∈ 1,2,…,

• _{
N + 1},_{
N + 2},…,_{
M
} ∈

• _{1} + ⋯ + _{
M
} =

The generating functions for short parts,

Standard methods for enumerating compositions with generating functions give that

where we expand the left side in a MacLaurin series (Taylor series centered at

To compute _{
M,N,K
}(^{
L + 1}in (A5). We present two methods to do this.

**First Taylor series method:** The coefficient of ^{
L + 1} in the result is _{
M,N,K
}(

**Second Taylor series method:** We present an exact closed-form solution. Mathematically, closed-form solutions are usually preferred. However, the first method above may be preferable for computation because intermediate steps of this second method require much higher precision, as discussed in Appendix 2.

Theorem A1

_{
M,N,K
}(

_{max} = min(⌈

_{max} < 0 _{
M,N,K
}(

Proof

For

For

The binomial theorem and the negative binomial series give

Plugging these into (A7), we obtain

In (A5), the coefficient of ^{
L + 1} is _{
M,N,K
}(

Appendix 2

Numerical precision of the closed form solution for the number of anchors

Theorem A1 (also called the “Second Taylor series method”) gives a closed form expression (A6) to compute _{
M,N,K
}(

For ^{93} and 2^{401}, while the value of the sum is much smaller, with magnitude 2^{294}. Using high precision floating point, we need at least 110 bits for the mantissa to get the first decimal digit correct. This is significantly more bits than is currently standard: the current standard for floating point, IEEE 754, provides for a 53 bit mantissa in double precision. Alternatively, using high precision integers, we would need 294 bits of integer precision, plus a sign bit. However, software for arbitrary precision integers, such as Maple or Mathematica, will handle this example correctly.

By contrast, the “First Taylor series method” only involves sums and products of positive integers, each bounded above by the value of _{
M,N,K
}(_{
M,N,K
}(

Appendix 3

Statistics of number of anchors

We may estimate the number of anchors using the following theorem.

Theorem A2

Fix

For fixed

For fixed _{
M,N,K
}(

Note that

and thus the probability of exactly

Next, for fixed

Using standard generating function properties, the numerator
^{
L + 1} in the following expression:

First we evaluate the derivative; second, we plug in ^{
L + 1}; and fourth, we use this to compute

1. The derivative in Eq. (A14) is

2. Plug in

3. Expand the Taylor series and extract the coefficient of ^{
L + 1}:

The term ^{
L + 1}occurs when

4.Evaluate

Note that if

Next we compute the variance of N, using a similar generating function technique. The generating function will enable us to compute

which is equivalent to the more common formula ^{2} = ^{2}] − ^{2}. We have:

The numerator
^{
L + 1}in the following expression:

We evaluate this in a fashion similar to

1. The derivative in Eq. (A15) is

2. Plug in

3. Expand the Taylor series and extract the coefficient of ^{
L + 1}:

The term ^{
L + 1} occurs when

4. Evaluate

5. Evaluate ^{2} = Var[

Appendix 4

Asymptotic number of anchors

Theorem A3

Let ^{2} be given by Theorem A2. For sufficiently large

where

Proof

For fixed _{
M,K
}(_{
M,N,K
}(^{
L + 1} for fixed ^{2}of this distribution in Theorem A2. By Eq. (A13), the total of the coefficients in _{
M,N,K
}(

Thus, we obtain Eqs. (A16) and (A17) as approximations for the coefficients _{
M,N,K
}(

In Figure
_{
M,N,K
}(

The fraction of configurations with exactly and at least

**The fraction of configurations with exactly and at least****anchors.** (**A**) Plot of the fraction of configurations with exactly _{M,N,K}(^{2}are computed by Theorem A2. (**B**) The solid markers are a plot of

Competing interests

MJC is a full-time employee at Pacific Biosciences, a company commercializing single-molecule, real-time nucleic acid sequencing technologies. GT was partially supported by a grant from the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). NIH had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author’s contributions

MJC proposed and implemented the mapping method, performed the analysis, and wrote the manuscript. GT solved and implemented the combinatorial analysis, and wrote the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

We thank Jon Sorenson, James Bullard, Eric Schadt, and Jonas Korlach for useful comments in writing this manuscript.