Abstract
Base calling is a critical step in the Solexa nextgeneration sequencing procedure. It compares the positionspecific intensity measurements that reflect the signal strength of four possible bases (A, C, G, T) at each genomic position, and outputs estimates of the true sequences for short reads of DNA or RNA. We present a Bayesian method of base calling, BMBC, for SolexaGA sequencing data. The Bayesian method builds on a hierarchical model that accounts for three sources of noise in the data, which are known to affect the accuracy of the base calls: fading, phasing, and crosstalk between channels. We show that the new method improves the precision of base calling compared with currently leading methods. Furthermore, the proposed method provides a probability score that measures the confidence of each base call. This probability score can be used to estimate the false discovery rate of the base calling or to rank the precision of the estimated DNA sequences, which in turn can be useful for downstream analysis such as sequence alignment.
Introduction
Next generation sequencing (NGS) such as Solexa sequencing (http://www.illumina.com webcite) is a powerful tool producing massive sequences of short reads. It is considered the “digital” version of the classic microarray technology because in principle it measures the exact number of gene copies rather than relative abundances. NGS can be used for studies of sequence variations in genomes ([1,2]), proteinDNA interactions ([3,4]), transcriptome analysis ([57]), and de novo genome assembly [8]. The full potential of the technology is still being explored as quantitative researchers try to find efficient ways to streamline the sample processing and model the processed data.
Many challenges remain in processing NGS data. We consider one of the important problems, namely base calling. Base calling refers to the estimation of the true sequences of DNA or RNA based on the intensity scores measuring the signal strength of four nucleotides, A, C, G, and T. One of the most popular NGS technology is the Solexa/Illumina sequencing, in which intensity data from a standard run consist of millions of intensity measurements for the four bases of short reads spanning across the genome. For each short read, the measurements of their intensities are stored in an I × 4 matrix, where I is the length of the read (e.g., I = 36). Such a matrix corresponds to a colony. The positions i = 1, ..., I in the short read are sequenced in cycles. As a result, each row of the colony matrix contains measurements from a cycle in the experiment in which the sequence of a single base is synthesized. At each cycle, all four nucleotides (A, C, G, and T) labeled with four different fluorescent dyes are probed, thus producing a quadruple vector of fluorescent intensity scores. Figure 1 plots the A intensities versus the C intensities (top left panel) and the G intensities versus the T intensities (top right panel) for 1,000 arbitrarily chosen colonies. The four colors used in the bottom two panels represent the estimated base calls from the proposed BMBC method. Figure 1 exhibits two main features. First, the A and C intensities are highly correlated as are the G and T intensities, which is known as the “cross talk” between channels [9]. Second, when the A or C intensity is large, both the G and T intensities are small; similarly, when G or T is large, both A and C are small.
Figure 1. Scatter plot. The panel shows the scatter plots of the AC and GT pairs, constructed from the raw data alone. The y axis and the x axis in the left panel represent the C and A channels respectively. Similarly, the y and the x axes in the right panel denotes the T and G channels. The top panel consists of smoothed density plots of A intensities versus C intensities, and G intensities versus T intensities. The four colors in the figures of the bottom panel represent the estimated base calls from the proposed BMBC method: black A, red  Cgreen G, blueT. The intensity values shown in the figure are normalized by subtracting from the overall minimum intensity and then dividing by the standard deviation.
In summary, the final data are millions of quadruple vectors. Each vector contains four continuous scores that represent the fluorescent intensities of nucleotides A, C, G, and T. Using these data, our task is to estimate the sequence of each short read.
We acknowledge that the proposed method in this paper deals with the data from Solexa genome analyzer. New sequencing technologies have been developed by Solexa/Illumina, such as the HiSeq series. However, numerous data sets have already been generated using the genome analyzer, which need to be properly analyzed. We believe that our proposed basecalling approach will contribute to the analysis of the existing data and also future data from experiments that still use the genome analyzer for sequencing. To our knowledge, a few methods for base calling are available in the literature. Most researchers use the default procedure, Bustard, built into the commercial software of the Illumina Genome Analyzer. The procedure yields an estimated base for each cycle along with a quality score called fastq. The fastq score measures the most likely base intensity relative to the three other intensities on a logarithmic scale from –5 to 40. In practice, DNA tags with small fastq scores are discarded in Solexa base calling. A more recent statistical method of base calling is by [10], who considered a variety of issues in the sequencing data including the base calling. Other works include [11,12]. A recent addition to this group of methods is Ibis (Improved Base Calling for Genome Sequence Analyzer) [13]. Ibis applies multiclass Support Vector Machines to raw cluster intensities. The model is trained from data obtained from a reference genome.
In this paper, we propose a modelbased Bayesian method of basecalling (BMBC) for Solexa sequencing data. The BMBC method presents a hierarchical model that applies a probabilisticbased inference for base calling. The estimation of model parameters is computed via Markov chain Monte Carlo (MCMC) simulations and the posterior samples are used to compute the probability that each base is A, C, G, or T. These posterior probabilities are used to estimate the true DNA sequences, to rank the base calls, and to compute the false discovery rates (FDR). The remainder of this paper is organized as follows: The Methodology section presents a probability model for base calling, and the posterior inference procedure. The section on Numerical examples presents the basecalling results for a Solexa sequencing data set using the BMBC method and three other methods as comparison. The Discussion sections ends the paper.
Methodology
To start, we introduce the three known sources of noise in the Solexa data that motivated the proposed probability models. The first type of noise is called fading (see e.g., [10]), which refers to a decay in the intensity as a function of cycle number. That is, for a colony, as the cycle number increases, the intensity measurement decreases. This is usually caused by material loss during the sequencing process. The second source is phasing, a wellknown source of noise in Solexa sequencers that use cyclic reversible termination (CRT) ([14,15]). Basically, errors in the CRT cause stochastic failures in basebinding that is supposed to incorporate only one nucleotide per cycle. Instead, the errors may lead to incorporation of none or more than one nucleotide in one cycle, thus increasing the noise in the signal output for downstream cycles. As a result, the precision of base calling drops as the cycle number increases (see Figure 2). The third important source of noise is a fluorophore cross talk between channels A and C, and channels G and T. The cross talk induces high correlations between A intensities and C intensities, and between G intensities and T intensities (see Figure 1). There are many factors that contribute to crosstalk between channels, one of them being an overlap in the wavelengths of the dye schemes used to mark different nucleotides.
Figure 2. Error rates for a random subsample of 1000 clusters. (Colored figure) The error rate for each cycle. The error rate of the Solexa calls has a large increase after cycle 26, while the error rates of the BMBC, BI, and Rolexa calls increase gradually over the cycles.
Other important systematic biases also affect the accuracy of base calling. For a discussion, see [14,15]. However, these biases can be removed or reduced using standard statistical techniques. We assume that these biases have been removed and now the goal is to model the intensity scores.
Hierarchical models
We first consider models for sequence data of a single colony, i.e., measurements corresponding to a short read, with say I = 36 bases. Let y = {y_{1}, …, y_{36}} represent the 36 quadruplets of nucleotide intensity measurements, where y_{i} = (y_{i}_{1}, y_{i}_{2}, y_{i}_{3}, y_{i}_{4})′ is the 4 × 1 vector for cycle i, respectively representing the intensities of four nucleotides, A, C, G, and T at location i of the short reads. Therefore, strong signals are indicated by large positive values of y_{ij}. Because for each cycle only one true nucleotide is present, ideally only one of the four y_{ij}’s should be positive and the remaining three should be zero. In the presence of noise, this is not the case. First, due to channel crosstalk, y_{i}_{1} and y_{i}_{2} are positively correlated, as are y_{i}_{3} and y_{i}_{4}. Second, because of fading, the intensities decay over cycles; that is, for later cycles, the values of y_{i}’s are smaller on average. Last, when phasing is present, the intensity scores at cycle i depend on the ones at cycle (i – 1).
Let k_{i} ∈ {1, 2, 3,4} indicates the true base of cycle i, where {1, 2, 3, 4} correspond to {A, C, G, T}. The main feature of the sampling model for y_{i} is given by an autoregression consisting of a mixture of four multivariate normal distributions, with each normal distribution describing the case when the true base is one of {A, C, G, T}. Specifically, letting MVN_{4}(µ, Σ) denote a 4dimensional normal with mean vector µ and covariance matrix Σ, we assume that for i = 2, …, 36,
and
where I_{j}’s are four indicator functions Ind(·) that truncate the multivariate normal. Here, . These indicators reflect the prior belief that the true base should have the largest intensity. Models (2) and (2) attempt to account for three sources of noise in the data. Specifically, due to fading, the intensity signals weaken as the cycle indicator i gets larger. Therefore, we include the exponential factor exp(–β · i^{λ}) to describe the decay of the mean signal. Note that we specify an exponent λ to allow for more flexibility. For the phasing, we add a term α · y_{i}_{–1},_{j} to the mean of the multivariate normal (thus autoregressive), i.e., the intensity of the current cycle i depends on the intensity of the previous cycle (i – 1) for i ≥ 2.
The cross talk is accounted for by constructing appropriate priors for μ_{j}’s, as described next. We assume that the mean intensities when the true base is A, C, G, or T are given by
When the true base is A (i.e., j = 1), the intensities at channels A and C are modeled by µ_{11} and µ_{12} while the intensities at channels T and G will be close to zero, parametrized as ε_{11} and ε_{12}. In addition, the mean intensity µ_{11} at channel A should be larger than µ_{12} at channel C. Therefore, the prior for µ_{1} is given by
We use a log N(0,1) prior for µ_{11}. Here, g_{1} accounts for the cross talk from channel C to channel A. We assign a beta(1, 1) as its prior. For g_{ε}_{1} and g_{ε}_{2}, we use beta(2,10) to reflect our strong belief that the intensities at channels G and T are much smaller than the intensity at channel A. We have tried other beta priors beta(a, b) with a ≪ b and obtained similar results in base calling.
The model is completed by specifying the discrete uniform prior for k_{i}, i.e., Pr(k_{i} = j) = 1/4 for j = 1, 2, 3, 4, a beta(1, 1) prior for λ, α, and β, and an inverse Wishart(diag(1, 4), 6) prior for ∑_{j}, where diag(1, 4) is the 4 × 4 identity matrix.
The models above are built for one colony of sequencing data. With multiple colonies, we use y_{ic} = (y_{ic,1}, …, y_{ic,4}) to denote the quadruple intensities of cycle i in colony c, and k_{ic} to represent the latent indicator of the true base of cycle i in colony c. The models for y_{ic} are the same as in (2) and (2), with y_{ic} and k_{ic} replacing y_{i} and k_{i}. The priors for k_{ic}, µ_{j}, λ, α, β, and ∑_{j} remain unchanged. Since y_{ic}’s are conditionally independent, the joint likelihood for all the data is simply the product of the likelihood function for each y_{ic}. For simplicity, the mathematical expression of the models is omitted.
Posterior inference
Inference is carried out via MCMC simulations. The probability models are coded in C (now included in an R package). The MCMC simulations output provides Monte Carlo posterior samples of all the parameters from the joint posterior distribution. These samples can be used to perform posterior inference. For example, we obtain random samples of k_{ic} from its marginal posterior, denoted as , where B is the number of MCMC samples. We can compute
as the posterior probability that the ith cycle in colony c has a true base of A, C, G, or T, respectively. These samples can be used to perform base calling. Specifically, the Bayesian base call corresponds to the nucleotide with the largest posterior probability in its cycle. That is, we assign base A, C, G, or T to cycle i in colony c if s_{ic} equals 1, 2, 3, or 4, where . In addition, one can assess the accuracy of the proposed method by computing an estimated Bayesian FDR ([16,17]) using the ξ’s. We will demonstrate this feature with a concrete example in the next section.
Numerical examples
We compared the performance of the BMBC method with currently leading methods, including the Solexa Bustard, the Rolexa method [11], and the BI method [10].
Data
We obtained Solexa DNA sequencing data from the control lane for a bacteria phage. This is part of the standard Solexa protocol. To illustrate the performance of base calling methods, we randomly selected three subsets, with each containing 1,000 colonies of the sequence data.
The control lane sequences the genome of an enterobacteria phage, phiX174, which is composed of 5,386 bases of single stranded DNA sequences and has no polymorphism. DNA preparation follows Illumina Control DNA library protocol (Illumina Cat. No CT9011001). DNA are broken to a size of 200 nucleotides and are subject to 18 cycles of polymerase chain reaction (PCR) amplification before the generation of DNA colonies by single molecule PCR. The sequences of DNA colonies are probed by 36 cycles of sequencing by synthesis.
Each DNA read is compared to the entire phage genome of 5,386 positions to search for the best matches. This is done using the Solexa software PhageAlign. After a tag is aligned to the phage genome, the matched sequence on the phage genome is considered to be the true sequence and any mismatched nucleotide is considered a sequencing error. The assignment of the true sequence is correct because 1) the phage genome contains no polymorphism and 2) the small genome size makes a mistaken sequence match over 36 nucleotides highly unlikely. Note that this is not the case for the human genome, where polymorphism occurs ([18]). Here, we treat the bases obtained from the above procedure as the “true” ones and compare the performance of base calling methods based on the deviation from these bases.
Analysis with random subsets
We first applied all the methods to a small data set for illustration purpose. We then implemented the BMBC method on a data set from the control lane of the Solexa sequencing, consisting of about 5 million short reads. We compare the following four basecalling methods using the phage sequencing data.
• Bustard from Solexa’s Genome Analyzer: this is the commercial software provided by Illumina. More detailed information about the Genome Analyzer can be found at http://www.illumina.com webcite.
• Rolexa: this is a method building upon modelbased clustering [11], which assumes that the quadruplets of intensities follow fourcomponent univariate Gaussian mixture models. Instead of performing a full Bayesian inference using the joint posterior distribution, the Rolexa method applies the EM algorithm to obtain point estimates of the parameters.
• BI: this is the intensity model proposed in Bravo and Irrizary (2010). The authors carefully examined potential noises in the intensity data and proposed a linear mixture model with different means given the indicator of true bases. They applied the EM algorithm to obtain the posterior probabilities of the true base calls. See [10].
• BMBC: our proposed method.
We applied all four methods to the three random subsets of phage sequencing data, each with 1,000 colonies.
For the BMBC method, we performed base calling using 100 colonies at a time. The Markov chains converged fast and mixed extremely well. We only needed to throw away 100 burnin samples with a total of 600 iterations for every 100 colonies.
We compared the estimated bases from the four methods with the true bases. Table 1 shows the number of wrong calls for each of the four methods. The BMBC method had the smallest number of wrong calls for two subsets and a close second for the third subset, in which the Rolexa yields the smallest number of wrong calls.
Table 1. Error rates for different methods under comparison
In Table 1, we used ACGT as the base calls for the Rolexa method. In the original paper by Rougemont et al. (2008), the authors focused on using the International Union of Pure and Applied Chemistry (IUPAC) symbols (http://www.bioinformatics.org/sms/iupac.html webcite) as base calls. These symbols include not only ACGT, but other ambiguous calls that represent more than one base within ACGT. The authors stated that the IUPAC symbols gave the Rolexa better performance. For a fair comparison, we used the ACGT symbols for the Rolexa.
For ease of exposition, we now focus on the results of an arbitrary subset, data set 1 in Table 1. We computed the difference in the number of correct calls per colony between the BMBC method and each of the other three methods.
We can see that the BMBC method is more likely to make right calls for a given colony than the other three methods. In addition, in extreme cases the BMBC method could make more than 20 more correct calls (out of a total of 36) than the other methods. In contrast, the largest number of more wrong calls the BMBC method could make is only 6. Figure 2 compares the error rates by cycle, defined as the proportion of wrong calls for each cycle across all colonies. Interestingly, the error rate for the Solexa calls has a large increase after cycle 26. See Figure 5 for more results related to this. This seems to suggest that the Solexa base calling is more sensitive to the phasing noise in the data. In contrast, the error rates for the other three methods increase gradually over the cycles. Both BMBC and Rolexa methods are also robust to phasing as it is specifically accounted for in the probability models. We can estimate the FDR based on the posterior probabilities ξ’s for base calls from the BMBC method. Because we know the true bases, we can precisely compute the FDR of the BMBC method. The idea is to treat as the local FDR. We present the following algorithm for computing the FDR based on the true bases.
1. Let the true base be t_{ic} for cycle i in colony c.
2. Compute ; then is the local FDR denoting the posterior probability of making a wrong call.
3. Rank the pairs (i, c) according to the increasing values of .
4. Starting from the highest ranking pair (i, c) with the smallest , move down to the Gth highest ranking pair. The estimated FDR is given by the sum of for all G pairs divided by G.
Figure 3 plots the estimated FDR versus the number of calls (ranked based on increasing values of ). We can see that the FDR is controlled by 0.04. This seems to agree with the error rate in Table 1. In cases where we do not know the true base calls, we only need to replace t_{ic} with , the estimated base call by the BMBC, in the above FDR algorithm to estimate the Bayesian FDR. This new value will be smaller because the errors in s_{ic} are not accounted for.
Figure 3. FDR plot. Bayesian FDR plot with 18,000 base calls under the BMBC method.
Full data analysis
We implemented the BMBC method on a data set consisting of 5,120,000 colonies. The data are from a control lane in a standard Solexa run, in which the true sequences are known. We first splitted the data into 8 equal parts, each comprising of 640,000 colonies. We then applied the BMBC method to each of the eight subsets in parallel. The eight jobs were executed on an iMAC with 2.8 GHz Intel Core i7 and 16 GB of memory. It took about 4 hours to complete the computation. We have built an R package “BMBC”, available to be downloaded from http://odin.mdacc.tmc.edu/~yuanj/soft.html webcite
We computed as the posterior probability that the base of cycle i in colony c is j, for j = 1, 2, 3, or 4. The base call is , the base with the largest posterior probability. We found that almost all the largest posterior probabilities were greater than 0.95, thus implying that our model was able to predict most bases with high degrees of confidence. Since we knew the true sequences for the data, we compared our predicted calls to the true sequences. Table 2 crosstabulates the comparison results. In Figure 4, we see that the BI error curves, though showing no such drastic jumps, still fares poorly compared to the BMBC method. For this dataset the BI also has a larger overall error rate of 8% compared to that of BMBC, which has an overall error rate of of 5%. Figure 5 plots the error rates by cycle for the BMBC and Solexa methods using the entire dataset. Although the overall error rates for the BMBC and the Solexa methods are comparable, the AC substitution rate for the Solexa calls show a large increase after cycle 26.
Table 2. Basecall Matching Rates
Figure 4. Comparison with Bl method. Comparison of Base errors per cycle for the BMBC method (right panel) and the BI method (left panel) in Bravo and Irizarry (2010) for a random subset of 50,000 colonies. The error rate of base calls is about 4.9% for the BMBC and about 8.0% for the BI method. The GT substitution error curve (shown by a turquoise green solid line) and the AC substitution curve (shown by a blue dotted line) dominates the other pairwise substitution rate in both the methods. However, clearly, the curves in the BMBC are lower both in the absolute scales and in the rate of increase with cycles.
Figure 5. Comparison with Solexa method. (Colored figure) Base errors per cycle for the entire dataset based on the BMBC (top panel) and the Bustard under Solexa sequencing (bottom panel). The plot further confirms that for the BMBC method, there is no increase in base substitution errors with increasing cycle, a common problem in most basecalling methods. Also the major potential substitution errors, AC and GT substitutions have been accounted for quite well. For the Bustard method, there is a large increase in the error rates (after cycle 26, shown by the green dotted line) for AC substitutions. Both methods yield an overall error rate of 4% in base calling.
Discussion
An important feature of the BMBC method is that it yields marginal posterior probabilities of the four nucleotides for each base. This allows a full probabilitybased inference for base calling and subsequent analysis. For example, one can associate the posterior probability of the base call with the estimated base and use it as a quality control measure for downstream sequence alignment. Sequences mapped to a genome with overall high posterior probabilities are more reliable than those with lower probabilities.
We also compared our method with the Bayesian classifier BayesCall in [12]. The computation was slow compared to the other methods. The slow speed could be a potential shortcoming for its application to data from NGS platforms, typical consisting of about millions of clusters. Naive Bayes classifiers, on the other hand, suffer from the simplistic assumptions of independence which are grossly violated in datasets of these type. One important feature of BMBC is that it does not require any prior learning for its application to GAI data. However, unsupervised clustering is not always feasible for data from newer sequencing technologies. Ibis [13] specifically uses large training data sets to analyze GAII control lanes. In addition, certain platforms possess unique features and need algorithms specially tailored to their specific requirements. Ibis, for example, is designed to model the features of bidirectional phasing and T accumulation which are present in GAII. On the other hand, BMBC is more suited towards addressing the issues of phasing, fading and cross talk that arise in the context of modeling GAI data.
We acknowledge that there is a scope of improving the model by incorporating the error sources unique to the latest sequencing platforms.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Conceived and designed the method: YJ FQ AJ SL. Performed the data analysis: YJ RM FQ AJ PL. Wrote the paper: YJ RM FQ PM YL SL.
Acknowledgement
Yuan Ji’s and Peter Müller’s research is partly supported by NIH/NCI R01 CA132897. Shoudan Liang’s research is partly supported by NIH/NCI K25 CA123344. Fernando Quintana’s research is partly supported by grants FONDECYT 1100010.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/13/S13/S1
References

Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Pairedend mapping reveals extensive structural variation in the human genome. [http://www.hubmed.org/fulltext.cgi?uids=17901297] webcite
Science 2007, 318(5849):420426. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER: Wholegenome sequencing and variant discovery in C. elegans. [http://www.hubmed.org/fulltext.cgi?uids=18204455] webcite
Nat Methods 2008, 5(2):183188. PubMed Abstract  Publisher Full Text

Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O’Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genomewide maps of chromatin state in pluripotent and lineagecommitted cells. [http://www.hubmed.org/fulltext.cgi?uids=17603471] webcite
Nature 2007, 448(7153):553560. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: Highresolution profiling of histone methylations in the human genome. [http://www.hubmed.org/fulltext.cgi?uids=17512414] webcite
Cell 2007, 129(4):823837. PubMed Abstract  Publisher Full Text

Hafner M, Landgraf P, Ludwig J, Rice A, Ojo T, Lin C, Holoch D, Lim C, Tuschl T: Identification of microRNAs and other small regulatory RNAs using cDNA library sequencing. [http://www.hubmed.org/fulltext.cgi?uids=18158127] webcite
Methods 2008, 44:312. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. [http://www.hubmed.org/fulltext.cgi?uids=18266620] webcite
Mol Ecol 2008, 17(7):16361647. PubMed Abstract  Publisher Full Text

Friedländer MR, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S, Rajewsky N: Discovering microRNAs from deep sequencing data using miRDeep. [http://www.hubmed.org/fulltext.cgi?uids=18392026] webcite
Nat Biotechnol 2008, 26(4):407415. PubMed Abstract  Publisher Full Text

Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. [http://www.hubmed.org/fulltext.cgi?uids=18083777] webcite
Genome Res 2008, 18(2):324330. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ: AltaCyclic: a selfoptimizing base caller for nextgeneration sequencing. [http://www.hubmed.org/fulltext.cgi?uids=18604217] webcite
Nat Methods 2008, 5(8):679682. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bravo H, Irizarry R: Modelbased quality assessment and basecalling for secondgeneration sequencing data.
Biometrics 2010., 66
To appear

Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F: Probabilistic base calling of Solexa sequencing data. [http://www.biomedcentral.com/14712105/9/431] webcite
BMC Bioinformatics 2008, 9:431. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kao W, Stevens K, Song Y: BayesCall: A modelbased basecalling algorithm for highthroughput shortread sequencing.
Genome Research 2009, 19:18841895. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kircher M, Stenzel U, Kelso J: Improved base calling for the Illumina Genome Analyzer using machine learning strategies.
Genome Biology 2009, 10:R83. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Metzker ML, Raghavachari R, Burgess K, Gibbs RA: Elimination of residual natural nucleotides from 3’OmodifieddNTP syntheses by enzymatic mopup. [http://www.hubmed.org/fulltext.cgi?uids=9821582] webcite
Biotechniques 1998, 25(5):814817. PubMed Abstract

Metzker ML: Emerging technologies in DNA sequencing. [http://www.hubmed.org/fulltext.cgi?uids=16339375] webcite
Genome Res 2005, 15(12):17671776. PubMed Abstract  Publisher Full Text

Newton M, Noueiry A, Sarkar D, Ahlquist P: Detecting differential gene expression with a semiparametric hierarchical mixture method.
Biostatistics 2004, 5:155176. PubMed Abstract  Publisher Full Text

Ji Y, Yin G, Tsui K, Kolonin M, Sun J, Arap W, Pasqualini R, Do KA: Bayesian mixture models for complex highdimension count data in phage display experiments.
Journal of the Royal Statistical Society, Series C (Applied Statistics) 2007, 56(2):139152. Publisher Full Text

Ji Y, Xu Y, Zhang Q, Tsui KW, Yuan Y, Liang S, Liang H: BMMap: Bayesian mapping of multireads for nextgeneration sequencing data. [http://odin.mdacc.tmc.edu/~ylji] webcite
Tech. rep The University of Texas M. D. Anderson Cancer Center; 2010.