Center for Clinical and Research Informatics, Northshore University HealthSystem, Evanston, IL 60091, USA

ICES, University of Texas at Austin, Austin, TX 78705, USA

Department of Statistics, Pontificia Universidad Católica de Chile, Casilla 306, Correo 22, Santiago, Chile

Department of Mathematics, The University of Texas at Austin, Austin, TX 78705, USA

Abbott Molecular Inc., Des Plaines, IL 60018, USA

Department of Leukamia, The University of Texas, M. D. Anderson Cancer Center, Houston, TX 77030, USA

Department of Bioinformatics & Computational Biology, The University of Texas, M. D. Anderson Cancer Center, Houston, TX 77030, USA

Abstract

Base calling is a critical step in the Solexa next-generation sequencing procedure. It compares the position-specific intensity measurements that reflect the signal strength of four possible bases (A, C, G, T) at each genomic position, and outputs estimates of the true sequences for short reads of DNA or RNA. We present a Bayesian method of base calling, BM-BC, for Solexa-GA sequencing data. The Bayesian method builds on a hierarchical model that accounts for three sources of noise in the data, which are known to affect the accuracy of the base calls:

Introduction

Next generation sequencing (NGS) such as Solexa sequencing (

Many challenges remain in processing NGS data. We consider one of the important problems, namely base calling. Base calling refers to the estimation of the true sequences of DNA or RNA based on the intensity scores measuring the signal strength of four nucleotides, A, C, G, and T. One of the most popular NGS technology is the Solexa/Illumina sequencing, in which intensity data from a standard run consist of millions of intensity measurements for the four bases of short reads spanning across the genome. For each short read, the measurements of their intensities are stored in an

Scatter plot

**Scatter plot.** The panel shows the scatter plots of the A-C and G-T pairs, constructed from the raw data alone. The y axis and the x axis in the left panel represent the C and A channels respectively. Similarly, the y and the x axes in the right panel denotes the T and G channels. The top panel consists of smoothed density plots of A intensities versus C intensities, and G intensities versus T intensities. The four colors in the figures of the bottom panel represent the estimated base calls from the proposed BM-BC method: black- A, red - C-green G, blue-T. The intensity values shown in the figure are normalized by subtracting from the overall minimum intensity and then dividing by the standard deviation.

In summary, the final data are millions of quadruple vectors. Each vector contains four continuous scores that represent the fluorescent intensities of nucleotides A, C, G, and T. Using these data, our task is to estimate the sequence of each short read.

We acknowledge that the proposed method in this paper deals with the data from Solexa genome analyzer. New sequencing technologies have been developed by Solexa/Illumina, such as the HiSeq series. However, numerous data sets have already been generated using the genome analyzer, which need to be properly analyzed. We believe that our proposed base-calling approach will contribute to the analysis of the existing data and also future data from experiments that still use the genome analyzer for sequencing. To our knowledge, a few methods for base calling are available in the literature. Most researchers use the default procedure, Bustard, built into the commercial software of the Illumina Genome Analyzer. The procedure yields an estimated base for each cycle along with a quality score called fast-q. The fast-q score measures the most likely base intensity relative to the three other intensities on a logarithmic scale from –5 to 40. In practice, DNA tags with small fast-q scores are discarded in Solexa base calling. A more recent statistical method of base calling is by

In this paper, we propose a model-based Bayesian method of base-calling (BM-BC) for Solexa sequencing data. The BM-BC method presents a hierarchical model that applies a probabilistic-based inference for base calling. The estimation of model parameters is computed via Markov chain Monte Carlo (MCMC) simulations and the posterior samples are used to compute the probability that each base is A, C, G, or T. These posterior probabilities are used to estimate the true DNA sequences, to rank the base calls, and to compute the false discovery rates (FDR). The remainder of this paper is organized as follows: The Methodology section presents a probability model for base calling, and the posterior inference procedure. The section on Numerical examples presents the base-calling results for a Solexa sequencing data set using the BM-BC method and three other methods as comparison. The Discussion sections ends the paper.

Methodology

To start, we introduce the three known sources of noise in the Solexa data that motivated the proposed probability models. The first type of noise is called

Error rates for a random subsample of 1000 clusters

**Error rates for a random subsample of 1000 clusters**. (Colored figure) The error rate for each cycle. The error rate of the Solexa calls has a large increase after cycle 26, while the error rates of the BM-BC, B-I, and Rolexa calls increase gradually over the cycles.

Other important systematic biases also affect the accuracy of base calling. For a discussion, see

Hierarchical models

We first consider models for sequence data of a single colony, i.e., measurements corresponding to a short read, with say ** y** = {

Let _{i}_{i}_{4}(** µ**, Σ) denote a 4-dimensional normal with mean vector

and

where _{j}^{λ}_{i}_{–1},_{j}

The cross talk is accounted for by constructing appropriate priors for _{j}

When the true base is A (i.e., _{11} and _{12} while the intensities at channels T and G will be close to zero, parametrized as _{11} and _{12}. In addition, the mean intensity _{11} at channel A should be larger than _{12} at channel C. Therefore, the prior for _{1} is given by

We use a log N(0,1) prior for _{11}_{1} accounts for the cross talk from channel C to channel A. We assign a _{ε}_{1} and _{ε}_{2}, we use

The model is completed by specifying the discrete uniform prior for _{i}_{i}_{j}

The models above are built for one colony of sequencing data. With multiple colonies, we use _{ic}_{ic,1}, …, _{ic,4}) to denote the quadruple intensities of cycle _{ic}_{ic}_{ic}_{ic}_{i}_{i}_{ic}_{j}, λ, α_{j}_{ic}_{ic}

Posterior inference

Inference is carried out via MCMC simulations. The probability models are coded in C (now included in an R package). The MCMC simulations output provides Monte Carlo posterior samples of all the parameters from the joint posterior distribution. These samples can be used to perform posterior inference. For example, we obtain random samples of _{ic}

as the posterior probability that the _{ic}

Numerical examples

We compared the performance of the BM-BC method with currently leading methods, including the Solexa Bustard, the Rolexa method

Data

We obtained Solexa DNA sequencing data from the control lane for a bacteria phage. This is part of the standard Solexa protocol. To illustrate the performance of base calling methods, we randomly selected three subsets, with each containing 1,000 colonies of the sequence data.

The control lane sequences the genome of an enterobacteria phage, phiX174, which is composed of 5,386 bases of single stranded DNA sequences and has no polymorphism. DNA preparation follows Illumina Control DNA library protocol (Illumina Cat. No CT-901-1001). DNA are broken to a size of 200 nucleotides and are subject to 18 cycles of polymerase chain reaction (PCR) amplification before the generation of DNA colonies by single molecule PCR. The sequences of DNA colonies are probed by 36 cycles of sequencing by synthesis.

Each DNA read is compared to the entire phage genome of 5,386 positions to search for the best matches. This is done using the Solexa software PhageAlign. After a tag is aligned to the phage genome, the matched sequence on the phage genome is considered to be the true sequence and any mismatched nucleotide is considered a sequencing error. The assignment of the true sequence is correct because 1) the phage genome contains no polymorphism and 2) the small genome size makes a mistaken sequence match over 36 nucleotides highly unlikely. Note that this is not the case for the human genome, where polymorphism occurs (

Analysis with random subsets

We first applied all the methods to a small data set for illustration purpose. We then implemented the BM-BC method on a data set from the control lane of the Solexa sequencing, consisting of about 5 million short reads. We compare the following four base-calling methods using the phage sequencing data.

• Bustard from Solexa’s Genome Analyzer: this is the commercial software provided by Illumina. More detailed information about the Genome Analyzer can be found at

• Rolexa: this is a method building upon model-based clustering

• B-I: this is the intensity model proposed in Bravo and Irrizary (2010). The authors carefully examined potential noises in the intensity data and proposed a linear mixture model with different means given the indicator of true bases. They applied the EM algorithm to obtain the posterior probabilities of the true base calls. See

• BM-BC: our proposed method.

We applied all four methods to the three random subsets of phage sequencing data, each with 1,000 colonies.

For the BM-BC method, we performed base calling using 100 colonies at a time. The Markov chains converged fast and mixed extremely well. We only needed to throw away 100 burn-in samples with a total of 600 iterations for every 100 colonies.

We compared the estimated bases from the four methods with the true bases. Table

Error rates for different methods under comparison

Data sets

Number of wrong calls (percentage)

BM-BC

Solexa

B-I

Rolexa

1

**1,340 (3.7%**)

1,455 (4.0%)

1,428 (4.0%)

1,601 (4.4%)

2

**1,354 (3.7%**)

1,514 (4.2%)

1,426 (4.0%)

1,432 (4.0%)

3

1,385 (3.8%)

1,438 (4.0%)

1,444 (4.0%)

**1,345 (3.7%**)

The number of wrong calls for the methods under comparison: the proposed BM-BC, Solexa calls from the Bustard method, the method in Bravo and Irizarry (2010) (B-I), and the Rolexa method. Three subsets of Solexa sequencing data for a bacterial phage were selected, each with 1,000 colonies. Each row contains the number of missed calls (out of 36,000) for a subset. The bold entry in each row indicates the method with the fewest wrong calls.

In Table

For ease of exposition, we now focus on the results of an arbitrary subset, data set 1 in Table

We can see that the BM-BC method is more likely to make right calls for a given colony than the other three methods. In addition, in extreme cases the BM-BC method could make more than 20 more correct calls (out of a total of 36) than the other methods. In contrast, the largest number of more wrong calls the BM-BC method could make is only 6. Figure ** of the BM-BC method. The idea is to treat **

1. Let the true base be _{ic}

2. Compute

3. Rank the pairs (

4. Starting from the highest ranking pair (

Figure _{ic}_{ic}

FDR plot

**FDR plot**. Bayesian FDR plot with 18,000 base calls under the BM-BC method.

Full data analysis

We implemented the BM-BC method on a data set consisting of 5,120,000 colonies. The data are from a control lane in a standard Solexa run, in which the true sequences are known. We first splitted the data into 8 equal parts, each comprising of 640,000 colonies. We then applied the BM-BC method to each of the eight subsets in parallel. The eight jobs were executed on an iMAC with 2.8 GHz Intel Core i7 and 16 GB of memory. It took about 4 hours to complete the computation. We have built an R package “BM-BC”, available to be downloaded from

We computed

Basecall Matching Rates

Predicted calls

A

C

G

T

A

97.22

2.00

.3

.3

True calls

C

1.06

95.75

1.29

1.88

G

.00

.00

92.89

6.33

T

.00

.01

.00

98.52

Matching rates of Basecalls by percentages. The overall matching percentage is 96.24.

Comparison with Bl method

**Comparison with Bl method.** Comparison of Base errors per cycle for the BM-BC method (right panel) and the B-I method (left panel) in Bravo and Irizarry (2010) for a random subset of 50,000 colonies. The error rate of base calls is about 4.9% for the BM-BC and about 8.0% for the B-I method. The G-T substitution error curve (shown by a turquoise green solid line) and the A-C substitution curve (shown by a blue dotted line) dominates the other pairwise substitution rate in both the methods. However, clearly, the curves in the BM-BC are lower both in the absolute scales and in the rate of increase with cycles.

Comparison with Solexa method

**Comparison with Solexa method.** (Colored figure) Base errors per cycle for the entire dataset based on the BM-BC (top panel) and the Bustard under Solexa sequencing (bottom panel). The plot further confirms that for the BM-BC method, there is no increase in base substitution errors with increasing cycle, a common problem in most basecalling methods. Also the major potential substitution errors, A-C and G-T substitutions have been accounted for quite well. For the Bustard method, there is a large increase in the error rates (after cycle 26, shown by the green dotted line) for A-C substitutions. Both methods yield an overall error rate of 4% in base calling.

Discussion

An important feature of the BM-BC method is that it yields marginal posterior probabilities of the four nucleotides for each base. This allows a full probability-based inference for base calling and subsequent analysis. For example, one can associate the posterior probability of the base call with the estimated base and use it as a quality control measure for downstream sequence alignment. Sequences mapped to a genome with overall high posterior probabilities are more reliable than those with lower probabilities.

We also compared our method with the Bayesian classifier BayesCall in

We acknowledge that there is a scope of improving the model by incorporating the error sources unique to the latest sequencing platforms.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Conceived and designed the method: YJ FQ AJ SL. Performed the data analysis: YJ RM FQ AJ PL. Wrote the paper: YJ RM FQ PM YL SL.

Acknowledgement

Yuan Ji’s and Peter Müller’s research is partly supported by NIH/NCI R01 CA132897. Shoudan Liang’s research is partly supported by NIH/NCI K25 CA123344. Fernando Quintana’s research is partly supported by grants FONDECYT 1100010.

This article has been published as part of