Abstract
Error correction of sequenced reads remains a difficult task, especially in singlecell sequencing projects with extremely nonuniform coverage. While existing error correction tools designed for standard (multicell) sequencing data usually come up short in singlecell sequencing projects, algorithms actually used for singlecell error correction have been so far very simplistic.
We introduce several novel algorithms based on Hamming graphs and Bayesian subclustering in our new error correction tool BAYESHAMMER. While BAYESHAMMER was designed for singlecell sequencing, we demonstrate that it also improves on existing error correction tools for multicell sequencing data while working much faster on reallife datasets. We benchmark BAYESHAMMER on both kmer counts and actual assembly results with the SPADES genome assembler.
Background
Singlecell sequencing [1,2] based on the Multiple Displacement Amplification (MDA) technology [1,3] allows one to sequence genomes of important uncultivated bacteria that until recently had been viewed as unamenable to genome sequencing. Existing metagenomic approaches (aimed at genes rather than genomes) are clearly limited for studies of such bacteria despite the fact that they represent the majority of species in such important studies as the Human Microbiome Project [4,5] or discovery of new antibioticsproducing bacteria [6].
Singlecell sequencing datasets have extremely nonuniform coverage that may vary from ones to thousands along a single genome (Figure 1). For many existing error correction tools, most notably QUAKE [7], uniform coverage is a prerequisite: in the case of nonuniform coverage they either do not work or produce poor results.
Error correction tools usually attempt to correct the set of kcharacter substrings of reads called kmers and then propagate corrections to whole reads which are important to have for many assemblers. Error correction tools often employ a simple idea of discarding rare kmers, which obviously does not work in the case of nonuniform coverage.
Medvedev et al. [8] recently presented a new approach to error correction for datasets with nonuniform coverage. Their algorithm HAMMER makes use of the Hamming graph (hence the name) on kmers (vertices of the graph correspond to kmers and edges connect pairs of kmers with Hamming distance not exceeding a certain threshold). HAMMER employs a simple and fast clustering technique based on selecting a central kmer in each connected component of the Hamming graph. Such central kmers are assumed to be errorfree (i.e., they are assumed to actually appear in the genome), while the other kmers from connected components are assumed to be erroneous instances of the corresponding central kmers. However, HAMMER may be overly simplistic: in connected components of large diameter or connected components with several kmers of large multiplicities, it is more reasonable to assume that there are two or more central kmers (rather than one as in HAMMER). Biologically, such connected components may correspond to either (1) repeated regions with similar but not identical genomic sequences (repeats) which would be bundled together by existing error correction tools (including HAMMER); or (2) artificially united kmers from distinct parts of the genome that just happen to be connected by a path in the Hamming graph (characteristic to HAMMER).
In this paper, we introduce the BAYESHAMMER error correction tool that does not rely on uniform coverage. BAYESHAMMER uses the clustering algorithm of HAMMER as a first step and then refines the constructed clusters by further subclustering them with a procedure that takes into account reads quality values (e.g., provided by Illumina sequencing machines) and introduces Bayesian (BIC) penalties for extra subclustering parameters. BAYESHAMMER subclustering aims to capture the complex structure of repeats (possibly of varying coverage) in the genome by separating even very similar kmers that come from different instances of a repeat. BAYESHAMMER also uses a new approach for propagating corrections in kmers to corrections in the reads. All algorithms in BAYESHAMMER are heavily parallelized whenever possible; as a result, BAYESHAMMER gains a significant speedup with more processing cores available. These features make BAYESHAMMER a perfect error correction tool for singlecell sequencing.
We remark that HAMMER produces only a set of central kmers but does not correct reads, making it incompatible with most genome assemblers. QUAKE does correct reads but has severe memory limitations for large k and assumes uniform coverage. In contrast, EULERSR [9] and CAMEL [2] correct reads and do not make strong assumptions on coverage (both tools have been used for singlecell assembly projects [2]) which makes these tools suitable for comparison to BAYESHAMMER. Our benchmarks show that BAYESHAMMER outperforms these tools in both singlecell and standard (multicell) modes. We further couple BAYESHAMMER with a recently developed genome assembler SPADES [10] and demonstrate that assembly of BAYESHAMMERcorrected reads significantly improves upon assembly with reads corrected by other tools for the same datasets, while the total running time also improves significantly.
BAYESHAMMER is freely available for download as part of the SPADES genome assembler at http://bioinf.spbau.ru/spades/ webcite.
Methods
Notation and outline
Let ∑ = {A, C, G, T} be the alphabet of nucleotides (BAYESHAMMER discards kmers with uncertain bases denoted N). A kmer is an element of ∑^{k}, i.e., a string of k nucleotides. We denote the i^{th }letter (nucleotide) of a kmer x by x[i], indexing them from zero: 0 ≤ i ≤ k  1. A subsequence of x corresponding to a set of indices I is denoted by x[I]. We use interval notation [i, j] for intervals of integers {i, i + 1,..., j} and further abbreviate x[i, j] = x [{i, i + 1,..., j}]; thus, x = x[0, k  1]. Input reads are represented as a set of strings R ⊂ Σ* along with their quality values for each r ∈ R. We assume that q_{r}[i] estimates the probability that there has been an error in position i of read r. Notice that in practice, the fastq file format [11] contains characters that encode probabilities on a logarithmic scale (in particular, products of probabilities used below correspond to sums of actual quality values).
Below we give an overview of BAYESHAMMER workflow (Figure 2) and refer to subsequent sections for further details. On Step (1), kmers in the reads are counted, producing a triple statistics(x) = (count_{x}, quality_{x}, error_{x}) for each kmer x. Here, count_{x }is the number of times x appears as a substring in the reads, quality_{x }is its total quality expressed as a probability of sequencing error in x, and error_{x }is a kdimensional vector that contains products of error probabilities (sums of quality values) for individual nucleotides of x across all its occurrences in the reads. On Step (2), we find connected components of the Hamming graph constructed from this set of kmers. On Step (3), the connected components become subject to Bayesian subclustering; as a result, for each kmer we know the center of its subcluster. On Step (4), we filter subcluster centers according to their total quality and form a set of solid kmers which is then iteratively expanded on Step (5) by mapping them back to the reads. Step (6) deals with reads correction by counting the majority vote of solid kmers in each read. In the iterative version, if there has been a substantial amount of changes in the reads, we run the next iteration of error correction; otherwise, output the corrected reads. Below we describe specific algorithms employed in the BAYESHAMMER pipeline.
Figure 2. BAYESHAMMER workflow.
Algorithms
Step (1): computing kmer statistics
To collect kmer statistics, we use a straightforward hash map approach [12] that does not require storing instances of all kmers in memory (as excessive amount of RAM might be needed otherwise). For a certain positive integer N (the number of auxiliary files), we use a hash function h: ∑^{k }→ℤ_{N }that maps kmers over the alphabet Σ to integers from 0 to N  1.
Algorithm 1 Count kmers
for each kmer x from the reads R: do
compute h(x) and write x to File_{h(x)}.
for i ∈ [0, N  1]: do
sort File_{i }with respect to the lexicographic order;
reading File_{i }sequentially, compute statistics(s) for each kmer s from File_{i}.
Step (2): constructing connected components of Hamming graph
Step (2) is the essence of the HAMMER approach [8]. The Hamming distance between kmers x, y ∈ ∑^{k }is the number of nucleotides in which they differ:
For a set of kmers X, the Hamming graph HG_{τ}(X) is an undirected graph with the set of vertices X and edges corresponding to pairs of kmers from X with Hamming distance at most τ, i.e., x, y ∈ X are connected by an edge in HG_{τ}(X) iff d(x, y) ≤ τ (Figure 3). To construct HG_{τ}(X) efficiently, we notice that if two kmers are at Hamming distance at most τ, and we partition the set of indices [0,k  1] into τ + 1 parts, then at least one part corresponds to the same subsequence in both kmers. Below we assume with little loss of generality that τ + 1 divides k, i.e., k = σ (τ + 1) for some integer σ.
Figure 3. Hamming graphs HG_{1}(X) and HG_{2}(X). Hamming graphs HG_{1}(X) and HG_{2}(X) for X being the set of 4mers {ACGTG, CGTGT, GTGTG, ACATG, CATGT, ATGTG, ACCTG, CCTGT, CTGTC} of the reads ACGTGTG, ACATGTG, ACCTGTC. Blue edges denote Hamming distance 2.
For a subset of indices I ⊆ [0, k  1], we define a partial lexicographic ordering ≺_{I }as follows: x ≺_{I }y iff x[I] ≺ y[I], where ≺ is the lexicographic ordering on Σ*. Similarly, we define a partial equality =_{I }such that x =_{I }y iff x[I] = y[I]. We partition the set of indices [0, k  1] into τ + 1 parts of size σ and for each part I, sort a separate copy of X with respect to ≺_{I}. As noticed above, for every two kmers x, y ∈ X with d(x, y) ≤ τ, there exists a part I such that x =_{I }y. It therefore suffices to separately consider blocks of equivalent kmers with respect to =_{I }for each part I. If a block is small (i.e., of size smaller than a certain threshold), we go over the pairs of kmers in this block to find those with Hamming distance at most τ. If a block is large, we recursively apply to it the same procedure with a different partition of the indices. In practice, we use two different partitions of [0, k  1]: the first corresponds to contigious subsets of indices (recall that ):
Algorithm 2 Hamming graph processing
procedure HGPROCESS(X, max_quadratic)
Init components with singletons .
if ϒ > max_quadratic then
else
for s = 0,...,τ do
sort a copy of X with respect to , getting X_{s}.
for s = 0,...,τ do
output the set of equiv. blocks .
for each pair x, y ∈ ϒ do
if d(x, y) ≤ τ then join their sets in :
while the second corresponds to strided subsets of indices:
BAYESHAMMER uses a twostep procedure, first splitting with respect to (Figure 4) and then, if an equivalence block is large, with respect to . On the block processing step, we use the disjoint set data structure [12] to maintain the set of connected components. Step (2) is summarized in Algorithm 2.
Figure 4. Partial lexicographic orderings. Partial lexicographic orderings of a set X of 9mers with respect to the index sets , , and . Red dotted lines indicate equivalence blocks.
Step (3): Bayesian subclustering
In HAMMER's generative model [8], it is assumed that errors in each position of a kmer are independent and occur with the same probability ε, which is a fixed global parameter (HAMMER used ε = 0.01). Thus, the likelihood that a kmer x was generated from a kmer y under HAMMER's model equals
Under this model, the maximum likelihood center of a cluster is simply its consensus string [8].
In BAYESHAMMER, we further elaborate upon HAMMER's model. Instead of a fixed ε, we use reads quality values that approximate probabilities q_{x}[i] of a nucleotide at position i in the kmer x being erroneous. We combine quality values from identical kmers in the reads: for a multiset of kmers X that agree on the j^{th }nucleotide, it is erroneous with probability Π_{x∈X }q_{x}[j].
The likelihood that a kmer x has been generated from another kmer c (under the independent errors assumption) is given by
and the likelihood of a specific subclustering C = C_{1 }∪... ∪ C_{m }is
where c_{i }is the center (consensus string) of the subcluster C_{i}.
In the subclustering procedure (see Algorithm 3), we sequentially subcluster each connected component of the Hamming graph into more and more clusters with the classical kmeans clustering algorithm (denoted mmeans since k has different meaning). For the objective function, we use the likelihood as above penalized for overfitting with the Bayesian information criterion (BIC) [13]. In this case, there are C observations in the dataset, and the total number of parameters is 3 km + m  1:
• m  1 for probabilities of subclusters,
• km for cluster centers, and
• 2 km for error probabilities in each letter: there are 3 possible errors for each letter, and the probabilities should sum up to one. Here error probabilities are conditioned on the fact that an error has occurred (alternatively, we could consider the entire distribution, including the correct letter, and get 3 km parameters for probabilities but then there would be no need to specify cluster centers, so the total number is the same).
Algorithm 3 Bayesian subclustering
for all connected components C of the Hamming graph do
m := 1
ℓ_{1 }:= 2 log L_{1}(C) (likelihood of the cluster generated by the consensus)
repeat
m := m + 1
do mmeans clustering of C = C_{1 }∪...∪ C_{m }w.r.t. the Hamming distance; the initial approximation to the centers is given by kmers that have the least error probability
ℓ_{m }:= 2 · log L_{m}(C_{1},...,C_{m}) (3 km + m  1) · log C
until ℓ_{m }≤ ℓ_{m1}
output the best found clustering C = C_{1 }∪...∪ C_{m1}
Therefore, the resulting objective function is
for subclustering into m clusters; we stop as soon as ℓ_{m }ceases to increase.
Steps (4) and (5): selecting solid kmers and expanding the set of solid kmers
We define the quality of a kmer x as the probability that it is errorfree: . The kmer qualities are computed on Step (1) along with computing kmer statistics. Next, we (generously) define the quality of a cluster C as the probability that at least one kmer in C is correct:
In contrast to HAMMER, we do not distinguish whether the cluster is a singleton (i.e., C = 1); there may be plenty of superfluous clusters with several kmers obtained by chance (actually, it is more likely to obtain a cluster of several kmers by chance than a singleton of the same total multiplicity).
Initially we mark as solid the centers of the clusters whose total quality exceeds a predefined threshold (a global parameter for BAYESHAMMER, set to be rather strict). Then we expand the set of solid kmers iteratively: if a read is completely covered by solid kmers we conclude that it actually comes from the genome and mark all other kmers in this read as solid, too (Algorithm 4).
Step (6): reads correction
After Steps (1)(5), we have constructed the set of solid kmers that are presumably errorfree. To construct corrected reads from the set of solid kmers, for each base of every read, we compute the consensus of all solid kmers and solid centers of clusters of all nonsolid kmers covering this base (Figure 5). This step is formally described as Algorithm 5.
Figure 5. Read correction. Reads correction. Grey kmers indicate nonsolid kmers. Red kmers are the centers of the corresponding clusters (two grey kmers striked through on the right are nonsolid singletons). As a result, one nucleotide is changed.
Algorithm 4 Solid kmers expansion
procedure ITERATIVEEXPANSION(R, X)
while ExpansionStep(R, X) do
function EXPANSIONSTEP(R, X)
for all reads r ∈ R do
if r is completely covered by solid kmers then
mark all kmers in r as solid
Return TRUE if X has increased and FALSE otherwise.
Algorithm 5 Reads correction
Input: reads R, solid kmers X, clusters .
for all reads r ∈ R do
init consensus array υ: [0, r  1] × {A, C, G, T} → ℕ with zeros: υ(j, x[i]):= 0 for all i = 0,...,r  1 and j = 0,...,k  1
for i = 0,...,r  k do
if r[i, i + k  1] ∈ X (it is solid) then
for j ∈ [i, i + k  1] do
υ(j, r[i]):= υ(j, r[i]) + 1
if r[i, i + k  1] ∈ C for some C ∈ then
let x be the center of C
if x ∈ X (r belongs to a cluster with solid center) then
for j ∈ [i, i + k  1] do
υ(j, x[i]):= υ(j, x[i]) + 1
for i ∈ [0, r  1] do
r[i]:= arg max_{a∈Σ }υ(i, a).
Results and discussion
Datasets
In our experiments, we used three datasets from [2]: a singlecell E. coli, a singlecell S. aureus, and a standard (multicell) E. coli dataset. Pairedend libraries were generated by an Illumina Genome Analyzer IIx from MDAamplified singlecell DNA and from multicell genomic DNA prepared from cultured E. coli, respectively These datasets consist of 100 bp pairedend reads with insert size 220; both E. coli datasets have average coverage ≈ 600×, although the coverage is highly nonuniform in the singlecell case.
In all experiments, BAYESHAMMER used k = 21 (we observed no improvements for higher values of k).
kmer counts
Table 1 shows error correction statistics produced by di erent tools on all three datasets. For a comparison with HAMMER, we have emulated HAMMER with read correction by turning off Bayesian subclustering (HammerExpanded in the table) and both Bayesian subclustering and read expansion, another new idea of BAYESHAMMER (HammerNoExpansion in the table). Note that despite its more complex processing, BAYESHAMMER is significantly faster than other error correction tools (except, of course, for HAMMER which is a strict subset of BAYESHAMMER processing in our experiments and is run on BAYESHAMMER code). BAYESHAMMER also produces, in the singlecell case, a much smaller set of kmers in the resulting reads which leads to smaller de Bruijn graphs and thus reduces the total assembly running time. Since BAYESHAMMER trims only bad quality bases and does not, like QUAKE, trim bases that it has not been able to correct (it has been proven detrimental for singlecell assembly in our experiments), it does produce a much larger set of kmers than Quake on a multicell dataset.
Table 1. kmer statistics.
For a comparison of BAYESHAMMER with other tools in terms of error rate reduction across an average read, see the logarithmic error rate graphs on Figure 6. Note that we are able to count errors only for the reads that actually aligned to the genome, so the graphs are biased in this way. Note how the first 21 bases are corrected better than others in BAYESHAMMER and both versions of HAMMER since we have run it with k = 21; still, other values of k did not show a significant improvement in either kmer statistics or, more importantly, assembly results.
Figure 6. Error reduction. Error reduction by read position on logarithmic scale for the singlecell E. coli, singlecell S. aureus, and multicell E. coli datasets.
Assembly results
Tables 2 and 3 shows assembly results by the recently developed SPAdes assembler [10]; SPAdes was designed specifically for singlecell assembly, but has by now demonstrated stateoftheart results on multicell datasets as well.
Table 2. Assembly results, singlecell E.coli and S. aureus datasets (contigs of length ≥ 200 are used).
Table 3. Assembly results, multicell E.coli dataset (contigs of length ≥ 200 are used).
In the tables, N50 is such length that contigs of that length or longer comprise of the assembly; NG50 is a metric similar to N50 but only taking into account contigs comprising (and aligning to) the reference genome; NA50 is a metric similar to N50 after breaking up misassembled contigs by their misassemblies. NGx and NAx metrics have a more direct relevance to assembly quality than regular Nx metrics; our result tables have been produced by the recently developed tool QUAST [14].
All assemblies have been done with SPADES. The results show that after BAYESHAMMER correction, assembly results improve significantly, especially in the singlecell E. coli case; it is especially interesting to note that even in the multicell case, where BAYESHAMMER loses to QUAKE by kmer statistics, assembly results actually improve over assemblies produced from QUAKEcorrected reads (including genome coverage and the number of genes).
Conclusions
Singlecell sequencing presents novel challenges to error correction tools. In contrast to multicell datasets, for singlecell datasets, there is no pretty distribution of kmer multiplicities; one therefore has to work with kmers on a onebyone basis, considering each cluster of kmers separately. In this work, we further developed the ideas of HAMMER from a Bayesian clustering perspective and presented a new tool BAYESHAMMER that makes them practical and yields significant improvements over existing error correction tools.
There is further work to be done to make our underlying models closer to real life; for instance, one could learn a nonuniform distribution of single nucleotide errors and plug it in our likelihood formulas. Another natural improvement would be to try and rid the results of contamination by either human or some other DNA material; we observed significant human DNA contamination in our singlecell dataset, so weeding it out might yield a significant improvement. Finally, a new general approach that we are going to try in our further work deals with the technique of minimizers introduced by Roberts et al. [15]. It may provide significant reduction in memory requirements and a possible approach to dealing with paired information.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors contributed extensively to the work presented in this paper.
Declarations
The publication costs for this article were funded by the Government of the Russian Federation, grant 11.G34.31.0018.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1. webcite
Acknowledgements
We thank Pavel Pevzner for many fruitful discussions on all stages of the project. We are also grateful to Andrei Prjibelski and Alexei Gurevich for help with the experiments and to the anonymous referees whose comments have benefited the paper greatly. This work was supported the Government of the Russian Federation, grant 11.G34.31.0018. Work of the first author was also supported by the Russian Fund for Basic Research grant 120100450a and the Russian Presidential Grant MK6628.2012.1. Work of the second author was additionally supported by the Russian Fund for Basic Research grant 120100747a.
References

Grindberg R, Ishoey T, Brinza D, Esquenazi E, Coates R, Liu W, Gerwick L, Dorrestein P, Pevzner P, Lasken R, Gerwick W: Single cell genome amplification accelerates identification of the apratoxin biosynthetic pathway from a complex microbial assemblage.
PLOS One 2011, 6(4):e18565. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chitsaz H, YeeGreenbaum JL, Tesler G, Lombardo MJ, Dupont CL, Badger JH, Novotny M, Rusch DB, Fraser LJ, Gormley NA, SchulzTrieglaff O, Smith GP, Evers DJ, Pevzner PA, Lasken RS: Efficient de novo assembly of singlecell bacterial genomes from shortread data sets.
Nat Biotechnol 2011, 29:915921. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ishoey T, Woyke T, Stepanauskas R, Novotny M, Lasken R: Genomic sequencing of single microbial cells from environmental samples.
Current Opinion in Microbiology 2008, 11(3):198204. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gill S, Pop M, Deboy R, Eckburg P, Turnbaugh P, Samuel B, Gordon J, Relman D, FraserLiggett C, Nelson K: Metagenomic analysis of the human distal gut microbiome.
Science 2006, 312(5778):13551359. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hamady M, Knight R: Microbial community profiling for human microbiome projects: tools, techniques, and challenges.
Genome Res 2009, 19(7):11411152. PubMed Abstract  Publisher Full Text

Li J, Vederas J: Drug discovery and natural products: end of an era or an endless frontier?
Science 2009, 325(5937):161165. PubMed Abstract  Publisher Full Text

Kelley DR, Schatz MC, Salzberg SL: Quake: qualityaware detection and correction of sequencing errors.
Genome Biology 2010, 11(11):R116. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Medvedev P, Scott E, Kakaradov B, Pevzner P: Error correction of highthroughput sequencing datasets with nonuniform coverage.
Bioinformatics 2011, 27(13):i13741. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chaisson MJ, Pevzner P: Short read fragment assembly of bacterial genomes.
Genome Research 2008, 18:324330. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov A, Lesin V, Nikolenko S, Pham S, Prjibelski A, Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, Alekseyev M, Pevzner P: SPAdes: a new genome assembler and its applications to single cell sequencing.
Journal of Computational Biology 2012, 19(5):455477. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cock P, Fields C, Goto N, Heuer M, Rice P: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.
Nucleic Acids Res 2010, 38(6):17671771. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cormen TH, Leiserson CE, Rivest R: Introduction to Algorithms. MIT Press; 2009.

Schwarz G: Estimating the dimension of a model.
Annals of Statistics 1978, 6:461464. Publisher Full Text

Gurevich A, Saveliev V, Vyahhi N, Tesler G: QUAST: Quality Assessment for Genome Assemblies.
2012.
[Submitted]

Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA: Reducing storage requirements for biological sequence comparison.
Bioinformatics 2004, 20(18):33633369. PubMed Abstract  Publisher Full Text