Department of Electrical and Computer Engineering, University of Texas at Austin, 1 University Station C0803, TX, 78712-0240, US

Abstract

Background

Next-generation sequencing systems are capable of rapid and cost-effective DNA sequencing, thus enabling routine sequencing tasks and taking us one step closer to personalized medicine. Accuracy and lengths of their reads, however, are yet to surpass those provided by the conventional Sanger sequencing method. This motivates the search for computationally efficient algorithms capable of reliable and accurate detection of the order of nucleotides in short DNA fragments from the acquired data.

Results

In this paper, we consider Illumina’s sequencing-by-synthesis platform which relies on reversible terminator chemistry and describe the acquired signal by reformulating its mathematical model as a Hidden Markov Model. Relying on this model and sequential Monte Carlo methods, we develop a parameter estimation and base calling scheme called ParticleCall. ParticleCall is tested on a data set obtained by sequencing phiX174 bacteriophage using Illumina’s Genome Analyzer II. The results show that the developed base calling scheme is significantly more computationally efficient than the best performing unsupervised method currently available, while achieving the same accuracy.

Conclusions

The proposed ParticleCall provides more accurate calls than the Illumina’s base calling algorithm, Bustard. At the same time, ParticleCall is significantly more computationally efficient than other recent schemes with similar performance, rendering it more feasible for high-throughput sequencing data analysis. Improvement of base calling accuracy will have immediate beneficial effects on the performance of downstream applications such as SNP and genotype calling.

ParticleCall is freely available at

Background

The advancements of next-generation sequencing technologies have enabled inexpensive and rapid generation of vast amounts of sequencing data

A widely used sequencing-by-synthesis platform, commercialized by Illumina, relies on reversible terminator chemistry. Illumina’s sequencing platforms are supported by a commercial base-calling algorithm called Bustard. While Bustard is computationally very efficient, its base-calling error rates can be significantly improved by various computationally more demanding schemes

In this paper, we propose a Hidden Markov Model (HMM) representation of the signal acquired by Illumina’s sequencing-by-synthesis platforms and develop a particle filtering (i.e., sequential Monte Carlo) base-calling scheme that we refer to as ParticleCall. When relying on the BayesCall’s Markov Chain Monte Carlo implementation of the EM algorithm (MCEM) to estimate system parameters, ParticleCall achieves the same error rate performance as BayesCall while reducing the time needed for base calling by a factor of 3. To improve the speed of parameter estimation, we develop a particle filter implementation of the EM algorithm (PFEM). PFEM significantly reduces parameter estimation time while leading to a very minor deterioration of the accuracy of base calling. Finally, we demonstrate that ParticleCall has the best discrimination ability among all of the considered base calling schemes.

Methods

In this section, we first review the data acquisition process and the basic mathematical model of the Illumina’s sequencing-by-synthesis platform. Then we introduce a Hidden Markov Model (HMM) representation of the acquired signals. Relying on the HMM model and particle filtering (i.e., sequential Monte Carlo) techniques, we develop a novel base calling and parameter estimation scheme and discuss some important practical aspects of the proposed method.

Illumina sequencing platform

A sequencing task on the Illumina’s platform is preceded by the preparation of a library of single-stranded short templates created by performing random fragmentation of the target DNA sample. Each single-stranded fragment in the library is placed on a glass surface (i.e., the flow cell

Quality of the acquired raw signals is adversely affected by the imperfections in the underlying sequencing-by-synthesis and signal acquisition processes. The imperfections are manifested as various sources of uncertainties. For instance, a small fraction of the strands being synthesized may fail to incorporate a base, or they may incorporate multiple bases in a single test cycle. These effects are referred to as phasing and pre-phasing, respectively, and they result in an incoherent addition of the signals generated by the synthesis of the complementary strands on the copies of the template. Other sources of uncertainty are due to cross-talk and delay effects in the optical detection process, the residual effects that are readily observed between subsequent test cycles, signal decay, and measurement noise.

Overview of the mathematical model

To describe the signal acquired by the Illumina’s sequencing-by-synthesis platform, a parametric model was proposed in

A length-^{ith}column of _{si}, is considered to be a randomly generated unit vector with a single non-zero entry indicating the type of the ^{ith}base in the sequence. We follow the convention where the first component of the vector _{si}corresponds to the base A, the second to C, the third to G, and the fourth to T and denote them as _{eA},_{eC},_{eG},_{eT}. The goal of base-calling is to infer unknown

Let

Let _{Pij}defined above, 1≤_{Hi,j}) is an ^{Pj} denotes the ^{jth} power of matrix _{λt},

where _{dt} is the per-cluster density decay parameter within [0,1]. We represent the ^{tth}column of _{ht}and the ^{tth} column of _{xt}. Incorporating the decay into the model, the signal generated in cycle

where

where _{Kt}denotes the 4×4 crosstalk matrix describing overlap of the emission spectra of the four fluorescent tags, and _{Σt}.

Note that, due to typically small values of _{ht} around its ^{tth} entry are significantly greater than the remaining ones. This observation can be used to simplify the expressions (2) and (3). In particular, let _{ht} around its ^{tth} entry, i.e., by setting small components of _{ht} to 0. In general, we consider _{ht}centered at position _{Ht−l,t},_{Ht−l + 1,t},…,_{Ht,t},…_{Ht + r−1,t},_{Ht + r,t}, and then expression (2) becomes

Finally, note that the signal measured in cycle _{αt}(1−_{dt})_{yt−1}to _{yt}, where the unknown parameter _{αt}∈(0,1). Therefore, the model can be summarized as

where ∥·_{∥2} denotes the _{l2}-norm of its argument, and where _{y0}=**0**, _{λ0}=1.

Hidden Markov Model of DNA base-calling

In this section, we reformulate the statistical description of the signal acquired by the Illumina’s sequencing-by-synthesis platform as a Hidden Markov Model (HMM) _{y1:L}, motivating the HMM representation. HMMs provide a convenient framework for state and parameter estimation, which we exploit to develop a particle filter base-calling scheme in the next section.

For the sake of convenience, we remove the dependency between subsequent observations _{yt−1} and _{yt} by defining

Components of _{x1:L}. Moreover, let

Since _{λt}and _{λt} and

The proposed HMM representation is illustrated in Figure

A hidden Markov model of the generated signal in Illumina sequencing-by-synthesis platforms.

**A hidden Markov model of the generated signal in Illumina sequencing-by-synthesis platforms.** An illustration of the graphical HMM of the Illumina’s sequencing platform. The observations ^{y″}represent signal intensities after the removal of residual effects. The states are the combinations of _{λt}, which represent a subsequence of the template centered at position

On the other hand, the state transition dynamics is described by the transition probability between subsequent states, _{λt} are independent, the transition probability is

The second term on the right-hand side of (8), _{f2}(_{λt}|_{λt−1}), is known from the density decay model (1),

For notational convenience, we use

Let _{eA},_{eC},_{eG},_{eT}}) denote a uniform distribution on the support set of unit vectors ({_{eA},_{eC},_{eG},_{eT}}). We assume no correlation between consecutive bases of the template sequence, i.e., _{eA},_{eC},_{eG},_{eT}}). Therefore,

where _{eA},_{eC},_{eG},_{eT}}). Hereby, all the components of the HMM are specified.

ParticleCall base-calling algorithm

The goal of base calling is to determine the order of nucleotides in a template from the acquired signal _{y1:t}. This can be rephrased as the problem of inferring the most likely sequence of states _{s1:L} follows directly from _{d1:L},_{α1:L},_{σ1:L},_{K1:L},_{Σ1:L}} are common for all clusters within a tile, and that they are provided by a parameter estimation step discussed in the following section. In this section, we introduce a novel base calling algorithm ParticleCall which relies on particle filtering techniques to sequentially infer

In general, particle filtering (i.e., sequential Monte Carlo) methods generate a set of particles with associated weights to estimate the posteriori distribution of unknown variables given the acquired measurements **s**_{t}|_{st} by solving

Our algorithm relies on a sequential importance sampling/resampling (SISR) particle filter scheme _{Keff} and, for the sake of simplicity, employ multinomial resampling strategy. If we denote the number of particles by _{Np} and associated weights by _{Keff} is below a fixed threshold _{Nthreshold}. _{Nthreshold} of size _{Np}) is typically sufficient _{Nthreshold}=_{Np}/2.

We omit further details for brevity and formalize the ParticleCall algorithm below.

Algorithm 1

ParticleCall base-calling algorithm

1. Initialization:

1.1 Initialize particles:**for**_{Np}**do**

Sample each column of the submatrix _{eA},_{eC},_{eG},_{eT}}); Sample **end for**

1.2 Compute and normalize weights for each particle according to

2. Run iteration

2.1 Sampling:**for**_{Np}**do**

Sample **end for**

2.2 Update the importance weight

2.3 Normalize the weights. Calculate the posteriori probability of _{st} and obtain the estimate

2.4 Resampling:**if****then**

Draw _{Np}samples **end if**

Since _{λt} are independent according to (8), it is possible to Rao-Blackwellize the ParticleCall algorithm. Rao-Blackwellization is used to marginalize part of the states in the particle filter, hence reducing the number of needed particles _{Np}_{λt}, while relying on the particle filter to calculate _{λ1:t}|

The original posterior distribution of the states can be expressed as

Since _{λ1:t−1}|

Algorithm 2

Rao-Blackwellized ParticleCall algorithm

1. Initialization:

1.1 Initialize particles:**for**_{Np}**do**

Sample **end for**

1.2 Compute and normalize weights for each particle according to

1.3 Calculate the discrete distribution

2. Run iteration

2.1 Sampling:**for**_{Np}**do**

Sample **end for**

2.2 Update the importance weight

2.3 Resample if _{Keff}≤_{Nthreshold}

2.4 Update **for**_{Np}**do**

Update **end for**

In step 2.2 of Algorithm 2, the quantity

where

In step 2.4 of Algorithm 2, the update equation is obtained as

Parameter estimation

To determine the set of parameters

Assumptions on parameters

Recall that the set of parameters needed to run ParticleCall is _{d1:L}_{α1:L}_{σ1:L}_{K1:L}_{Σ1:L}}. The phasing and prephasing parameters _{d1:L} and _{σ1:L} are uniformly distributed over an interval and incorporate them into the hidden states of the HMM model. Therefore, only the mean and variance of these parameters, i.e., _{dmean}, _{dvar}, _{σmean}, and _{σvar}need to be estimated. Computational results demonstrate that these two assumptions does not affect the accuracy of base-calling.

Particle filter EM algorithm

In the early sequencing cycles, effects of phasing and prephasing are relatively small. Therefore, we may ignore phasing and prephasing to facilitate straight-forward computation of the initial estimates of the remaining parameters. In particular, the signal generated in the early cycles

Replacing (2) by (10) leads to a simplified model that allows for straightforward base calling and inference of the parameters by means of linear regression. We use these values to obtain the estimates of _{dmean}, _{dvar}, _{σmean}, and _{σvar}, and to initialize the remaining parameters

The parameter estimation is performed window-by-window and is conducted using

where

where the expectation is taken with respect to

Results and discussion

The proposed method is evaluated on a data set obtained by sequencing phiX174 bacteriophage using Illumina Genome Analyzer II with the cycle length 76. This is a short genome with a known sequence which enables reliable performance comparison of different base-calling techniques. We tested ParticleCall and several other algorithms on a tile containing 77337 reads, and present the results here. All the codes are written in C and the tests are run on a desktop with an Intel Core i7 4-core 3GHz processor.

Per-cycle error rates of ParticleCall, BayesCall, naiveBayesCall, Rolexa and Bustard. The figure compares the per-cycle error rates of different base-calling algorithms. ParticleCall and BayesCall are the most accurate ones

**Per-cycle error rates of ParticleCall, BayesCall, naiveBayesCall, Rolexa and Bustard.** The figure compares the per-cycle error rates of different base-calling algorithms. ParticleCall and BayesCall are the most accurate ones.

Performance of ParticleCall

The base calling error rates are computed by aligning the reads to the reference genome and evaluating frequency of mismatches. Reads that could not be aligned to the reference with at least 70% matches are discarded. Note that the error rates and speed of the proposed ParticleCall algorithm and the parameter estimation scheme are affected by the parameters _{Np}, and parameter estimation window length ^{0−8} and _{Np}is shown in Table _{Np}=800 leads to high performance with reasonable speed. Rao-Blackwellized ParticleCall can achieve the same accuracy with fewer particles (in particular, _{Np}=300); however, its effective running time is 3 times that of the original ParticleCall with the same performance. This is because the Rao-Blackwellization steps in (9) and (9) require evaluating a sum over all possible ^{3}=64 for our choice

Method

_{
N
p
}

error rate

base-calling time (min)

^{ParticleCall is run using parameters obtained via the MCEM parameter estimation scheme as well as via the PFEM parameter estimation algorithm proposed in this paper. Rao-Blackwellized ParticleCall is run using parameters via the MCEM parameter estimation scheme.}

ParticleCall (via MCEM)

400

0.0126

46

800

0.0124

88

1200

0.0124

130

ParticleCall (via PFEM)

400

0.0128

46

800

0.0125

91

1200

0.0125

133

Rao-Blackwellized ParticleCall (via MCEM)

100

0.0128

103

200

0.0125

190

300

0.0124

287

400

0.0124

386

parameter estimation

Window length

base-calling error rate

time (min)

^{ParticleCall base-calling error rate and the parameter estimation time of the proposed PFEM parameter estimation algorithm.}

4

0.0125

50

5

0.0125

39

6

0.0127

29

7

0.0130

25

Performance comparison of different algorithms

The error rates and speed of the proposed ParticleCall algorithm are compared with those of BayesCall, naiveBayesCall, Rolexa, and Bustard. We run ParticleCall both with parameters provided by the computationally intensive MCEM algorithm as well as with those inferred by the PFEM parameter estimation scheme proposed in this paper. The results are reported in Table

base-calling

parameter estimation

Method

error rate

time (min)

time (min)

^{The base-calling error rate and the running times of different algorithms. ParticleCall is run using parameters obtained via the MCEM parameter estimation scheme as well as via the PFEM parameter estimation algorithm proposed in this paper. For Bustard and Rolexa, only the total running times are reported.}

Bustard

0.0152

2 (total)

Rolexa

0.0170

35 (total)

naiveBayesCall

0.0132

21

1139

BayesCall

0.0124

231

1139

ParticleCall

(via MCEM)

0.0124

88

1139

ParticleCall

(via PFEM)

0.0125

91

39

Quality scores

Quality scores are used to characterize confidence in the outcome of the base-calling procedures. They are computed as part of the analysis of the acquired raw data and may be used to filter out reads of suspect quality, or to shorten the reads if the quality scores of individual bases fall below certain thresholds. They can also provide confidence information for downstream analysis including sequence assembly and SNP and genotype calling. Frequently used are the so-called

Essentially,

Quality scores can be used to compare the discrimination ability of different algorithms. The discrimination score

Discrimination ability

**Discrimination ability****) of quality scores vs error tolerance.** The figure shows the percentage of correctly called bases under different error tolerance

Effects of improved base-calling accuracy on de novo sequence assembly

In shotgun sequencing, a long target sequence is oversampled by a library of randomly fragmented copies of the target, and the overlaps between short reads obtained by a high-throughput platform are used to assemble the target. In

ParticleCall

ParticleCall

Coverage

Bustard

Rolexa

naiveBayesCall

BayesCall

via MCEM

via PFEM

N50

Max

N50

Max

N50

Max

N50

Max

N50

Max

N50

Max

^{The maximum contig length and N50 length of de novo assembly using Velvet. The average values over 200 experiments are shown in the table.}

5X

271

607

259

565

278

604

292

629

299

637

289

632

10X

1169

1750

971

1557

1180

1731

1269

1831

1316

1900

1341

1865

15X

3624

3823

2885

3170

3726

3908

3466

3741

3742

3935

3697

3918

20X

4694

4744

4529

4614

4756

4816

4827

4875

5102

5116

4795

5039

Conclusions

In this paper we presented ParticleCall, a particle filtering algorithm for base calling in the Illumina’s sequencing-by-synthesis platform. The algorithm is developed by relying on an HMM representation of the sequencing process. Experimental results demonstrate that the ParticleCall base calling algorithm is more accurate than Bustard, Rolexa, and naiveBayesCall. It is as accurate as BayesCall while being significantly faster. Quality score analysis of the reads indicates that ParticleCall has better discrimination ability than BayesCall, naiveBayesCall and Bustard. Moreover, a novel particle filter EM (PFEM) parameter estimation scheme, much faster than the existing Monte Carlo implementation of the EM algorithm, was proposed. When relying on the PFEM scheme, ParticleCall has near-optimal performance while needing much shorter total parameter estimation and base calling time.

Author’s contributions

Algorithms and experiments were designed by Xiaohu Shen (XS) and Haris Vikalo (HV). Algorithm code was implemented and tested by XS. The manuscript was written by XS and HV. Both authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work was funded by the National Institute of Health under grant 1R21HG006171-01.