Abstract
Background
PCR amplification and highthroughput sequencing theoretically enable the characterization of the finestscale diversity in natural microbial and viral populations, but each of these methods introduces random errors that are difficult to distinguish from genuine biological diversity. Several approaches have been proposed to denoise these data but lack either speed or accuracy.
Results
We introduce a new denoising algorithm that we call DADA (Divisive Amplicon Denoising Algorithm). Without training data, DADA infers both the sample genotypes and error parameters that produced a metagenome data set. We demonstrate performance on control data sequenced on Roche’s 454 platform, and compare the results to the most accurate denoising software currently available, AmpliconNoise.
Conclusions
DADA is more accurate and over an order of magnitude faster than AmpliconNoise. It eliminates the need for training data to establish error parameters, fully utilizes sequenceabundance information, and enables inclusion of contextdependent PCR error rates. It should be readily extensible to other sequencing platforms such as Illumina.
Background
The potential of highthroughput sequencing as a tool for exploring biological diversity is great, but so too are the challenges that arise in its analysis. These technologies have made possible the characterization of very rare genotypes in heterogeneous populations of DNA at low cost. But when applied to a metagenomic sample, the resulting raw data consist of an unknown mixture of genotypes that are convolved with errors introduced during amplification and sequencing.
There are two broad approaches to highthroughput sequencing of metagenomes: in amplicon sequencing (also called genecentric or genetargeted metagenomics) a pool of DNA for sequencing is produced by using PCR to amplify all the variant sequences in a sample that begin and end with a chosen pair of primers [13], frequently targeting hypervariable regions of the 16S ribosomal RNA gene [4]; in de novo genome assembly total DNA is sequenced without amplification and reads are clustered into “species bins”, each providing the material for genome assembly by shotgun methods (see Table 1 in [5] for a list of such studies).
By trading off a broad survey of gene content for greater sequencing depth at the sampled loci, amplicon sequencing has the potential to detect the rarest members of the sampled community, but errors interfere more profoundly. Unlike genome assembly projects, where one needs only to determine the consensus base at each locus or decide whether a SNP is present in a population, the space of possible distributions for the sample genotypes and frequencies is effectively infinite. As a result, ambiguities in genome projects can usually be resolved by increasing the amount of data, whereas increasing depth (as much as 10^{6} in recent studies [6,7]) increases the number of both real and errorcontaining sequences and makes the challenge of distinguishing minority variants from errors only greater under amplicon sequencing. Greater depth therefore calls for progressively more sophisticated methods of analysis.
The analysis of amplicon sequence data typically begins with the construction of OTUs (operational taxonomic units), clusters of sequences that are within a cutoff in Hamming distance from one another. OTUs serve to collapse the complete set of sequences into a smaller collection of representative sequences – one for each OTU – and corresponding abundances based on the number of reads falling within each cluster. OTUs were developed as a tool for classifying microbial species, but have also been repurposed to the task of correcting errors; the sequences within an OTU are typically interpreted as a taxonomic grouping without specifying whether the variation within an OTU represents errors or real diversity on a finer scale than that chosen to define the OTU. If the scale of the noise is smaller than that of the clusters, then the construction of OTUs will appropriately group errorcontaining sequences together with their true genotype. However, as sequencing depth increases, low probability errors outside the OTU radius will start to appear, and will be incorrectly assigned to their own OTU. Early studies using this approach on highthroughput metagenome data sets reported large numbers of lowabundance, previously unobserved genotypes that were collectively dubbed the rare biosphere[8]. Later, analyses of control data sets indicated that the diversity estimates in such studies tends to be highly inflated [9] and that results may lack reproducibility [10]. The dual purpose of OTUs for correcting errors and for taxonomic grouping is appropriate when the diversity is being sampled at a coarse level, e.g. the frequency of different phlya. However, when probing finerscale diversity, OTU methods have intrinsically high false positive and false negative rates: they both overestimate diversity when there exist errors larger than the OTUdefining cutoff and cannot resolve real diversity at a scale finer than that (arbitrary) cutoff.
In response, a variety of approaches to disentangling errors from actual genetic variation have been proposed recently [1114]. These include multiple rounds of OTU clustering with different hierarchical methods [11], utilizing sequence abundance information implicitly by starting new clusters with common sequences [11,12], and replacing OTU clustering with an ExpectationMaximization (EM) approach [13,14]. Accuracy has steadily improved, but all methods still fall short of maximizing the information acquired from metagenome data sets.
We believe that the way forward is to model the error process and evaluate the validity of individual sequences in the context of the full metagenomic data set, crucially including the abundances (number of reads) corresponding to each sequence. Major progress in this direction has been made recently by Quince et al[13,14]. In the specific context of pyrosequencing, often used for metagenomics, strings of the same nucleotide (homopolymers) are problematic, and Quince et al incorporated a model of the distribution of homopolymer light intensities into an ExpectationMaximization (EM) algorithm, Pyronoise, which infers the homopolymer lengths of sequencing reads [13] (Q09). Later, Quince et al released AmpliconNoise, an extension of PyroNoise that includes rates of singlenucleotide substitution errors obtained from training data [14] (Q11). These methods were shown to more accurately infer the underlying sample genotypes than other approaches, demonstrating the worth of explicitly modeling errors. However, the methods of Quince et al. have several shortcomings that we would like to rectify: (i) as the size of sequence data sets grows, AmpliconNoise becomes too slow to use in many applications; (ii) estimation of error rates relies on the existence of training data specific to the PCR and sequencing chemistries used; (iii) differentiation of finescale diversity is limited because read abundances are not fully utilized when calculating the distance between sequences and clusters; (iv) the parameters that determine how conservative the algorithm is in inferring real diversity are ad hoc and cannot be tuned without experimentspecific training data.
We build on the errormodeling approach pioneered in AmpliconNoise by developing a novel algorithm, DADA, to denoise metagenomic amplicon sequence data that addresses the concerns raised above [15]. We start with a parametric statistical model of substitution errors. We incorporate this error model into a divisive hierarchical clustering algorithm that groups errorcontaining reads into clusters consistent with being derived from a single sample genotype. Finally we couple this clustering algorithm with the inference of the error parameters from the clustered data, and perform each step in alternation until both converge. This method is presented below, and is shown to outperform previous methods in both speed and accuracy on several control data sets.
Results
Model and algorithm
We introduce a firstorder model of the error process by assuming (1) each sequence read originates from a distinct DNA molecule in the sample, and therefore that the presence of errors on different reads are statistically independent, and (2) errors on different sites of the same read are also statistically independent events. The independence of errors across different reads relies on the independence of the PCR replication histories of those reads, a condition that holds when the total number of reads is significantly smaller than the total number of DNA molecules present in the initial environmental sample and there are no strong amplification biases for sequences with errors.
Under these conditions, the numbers of reads (abundances) of the errorcontaining sequences derived from a sample genotype follow the multinomial distribution, and the abundance r of each particular sequence is binomially distributed (see Methods) with a probability λdetermined by the particular combination of errors in that sequence and a number of trials ρ given by the total number of reads of its sample genotype. These facts allow us to establish two statistics to evaluate the hypothesis that a collection of sequencing reads derives from a single sample genotype. The abundance pvalue determines when there are too many reads of the same sequence to be consistent with the error model, and the read pvalue determines when a sequencing read is too far away to be an error from an inferred sample genotype.
These statistics serve as the basis of a sequenceclustering algorithm in which (1) reads are assigned to clusters, (2) a putative sample genotype is inferred for each cluster, (3) reads are reassigned to the cluster for which they are most likely to have resulted as errors from the inferred sample genotype, (4) the two pvalue statistics are computed given the inferred sample genotypes and the clustering of the sequences (5) additional clusters are created if the clustering is statistically inconsistent with the error model (as suggested by small pvalues).
The full algorithm (Figure 1) combines this probabilistic sequence clustering with the estimation of substitutionerror probabilities that are used to compute the pvalues. The algorithm begins by assuming all reads derive from a single sample genotype and estimates initial error probabilities given this assumption. It then alternates between clustering the reads and reestimating the error probabilities until it converges to a final set of mutually consistent clusters and error probabilities.
Figure 1. DADA schematic. The basic structure of DADA, an algorithm to denoise amplicon sequence data. See Algorithm Algorithm 1 in the Methods section for the pseudocode and a more detailed description.
The pvalues
We introduce two statistics for deciding that particular sequences did not arise from errors. The read pvalue is the probability of having observed at least one read of a sequence that is as improbable as the most improbable sequence amongst the observed reads. This statistic treats each read as a separate event (giving rise to its name) and therefore does not utilize sequence abundance. It results in a hard cutoff, λ^{∗}, below which reads are decided not to be errors by DADA. This cutoff is set by the choice of a significance threshold Ω_{r}, the probability of having observed at least one read more unlikely than λ^{∗}. The abundance pvalue, which is computed for each sequence individually, is the probability of having observed at least as many identical reads as we did of each sequence (conditioned on having observed at least one). The conservativeness of this measure is set by a significance threshold Ω_{a}, the probability that at least one sequence should have been as overabundant as the most overabundant sequence. The abundance pvalue gives DADA significantly greater sensitivity than previous methods.
Figure 2 shows simulated and real data from a typical cluster of sequences that originated from a common genotype (from the Artificial data set, introduced below). The abundance r and probability λ of each sequence is plotted, as the ability of the read and abundance pvalues to discriminate between errors and nonerrors is easily visualized in this parameter space. The regions where DADA will declare a sequence to be an error or a real sequence are delineated by a dashed line for each of our pvalue statistics. The λ values have been logtransformed and scaled by the most common error probability, making the xaxis interpretable as an effective Hamming distance. Due to this scaling, it is also useful to interpret this plot in terms of real Hamming distances, in which case the Ω_{a}line represents a lower bound on DADA’s resolution for any error at that distance.
Figure 2. Discrimination plots for a typical cluster in the Artificial data set with 4691 reads. (a) simulated errors drawn from the error model and (b) the real errors in the cluster. Sequences (diamonds) are characterized by abundance and the probability λper read of having been produced. On the xaxis, we plot logλscaled by the most common error probability, T_{A→G}, so that values can be interpreted as an effective Hamming distance. The dashed lines delineate the region – the lower left quadrant – where, for significance thresholds Ω_{a}and Ω_{r} provided by the user, DADA accepts that a sequence could have arisen via the error model. The vertical dashed lines shows the λbelow which (or the effective distance above which) the read pvalue rejects sequences as being errors, and the curved dashed line shows the abundances above which the abundance pvalue rejects sequences as being errors for each value of λ. There are several sequences in the real data (red diamonds) that would be rejected by the abundance pvalue at the Ω_{a} = .01 significance level; we posit that early round PCR effects are a suitable candidate to explain these departures from the error model.
For both the real and simulated data, the abundance pvalue does a good job of tracking the form of the abundances of the errors, and the read pvalue sits to the right of all observed data. For the real data, a small number of errors sit on or above the abundance discrimination line. Such errors were individually not expected to be observed at all, but ended up with a small number of reads larger than one. This pattern was observed across many clusters, and we believe that it reveals the presence of small violations of our assumption of the independence between reads. In particular, in a regime where the ratio of the number of errorfree reads to the number of DNA molecules in the sample that act as the basis for amplification is of order one or larger, then errors during early stages of PCR may be sampled multiple times in the sequence data. As a result, the distribution for the number of reads of these errors may fall off much more slowly than what our model suggests. To deal with this effect in this paper, we lowered the Ω_{a}threshold using an ad hoc method (discussed below) to prevent excess false positives. Doing so did not affect DADA’s ability to detect the genuine diversity in the data analyzed in this paper, which was typical of the data analyzed in many microbial metagenomics studies, but the sensitivity that is lost by using very small values of Ω_{a} could be limiting for samples with even finerscale diversity. Further analytics that model PCR as a branching process improve this current ad hoc threshold (unpublished work).
Treatment of insertions and deletions
DADA does not attempt to explicitly model the indel error processes, and indels do not contribute to the determination of whether sequences are related to each other via errors. Instead, sequences are aligned to each putative sample genotype, and are assigned to clusters on the basis of substitutions. During the computation of pvalues, we sum together the reads of sequences within each cluster that have the same set of substitutions (forming structures that we call indel families). The number of reads of each of these indel families, rather than those of the raw sequences, are the basis of our pvalues (see Methods).
Treating indels in this way does not affect the accuracy of DADA for the test data sets analyzed here, as the sample genotypes all differed from each other by at least one substitution, and these provided enough information for DADA to distinguish between them. However, DADA cannot distinguish between sequences that differ only by indels. In such cases, if the amplicons being denoised are coding regions, frame information should be used for making decisions about whether particular indels are real or errors, but in order to denoise noncoding regions with pure indel diversity, DADA is not sufficient in its current form.
Preclustering
Prior to our probabilistic sequence clustering we divided the raw data into coarse 3% singlelinkage clusters (with indels not contributing to distance), subsets for which each sequence is ≤3% from at least one other sequence in its cluster and >3%from all sequences in other clusters. Due to its speed, we employed the ESPRIT algorithm for this task [16]. Singlelinkage’s propensity for chaining was advantageous in this circumstance, as all errorcontaining sequences are very likely to be in the same cluster as their originating sample genotypes; for a sample genotype and one of its errors to end up in different clusters, the error would have to be ≥3% from the nearest error clustered with that genotype, corresponding to a large gap in the error cloud, which is unlikely under our error model.
Clustering
Each precluster is partitioned into sets of sequences that are conjectured to contain all errors arising from different sample genotypes. This partition is initialized to a single cluster containing all sequences. Two procedures then alternate. First, the indel family most unlikely to have resulted from errors is split off into a new cluster. Sequences then move between clusters based on the probability that they were generated as errors by each one, and the consensus sequence for each cluster is updated until there are no remaining reassignments that can improve the probability of the data. This second step is analogous to the assignment and update steps of standard kmeans clustering. This alternation stops when the partition of the sequences fits with the current error model at the significance levels provided by the user.
Accuracy
We evaluated the accuracy of DADA by denoising three of the data sets in Q11 used to demonstrate AmpliconNoise’s accuracy relative to the earlier SLP and DeNoiser algorithms. These data are derived from mixtures of known clones that were amplified together and sequenced on the 454 platform, and consisted of different hypervariable regions of the 16S RNA subunit of bacterial ribosomes (16S rRNA), which are commonly used as a proxy for phylogenetic diversity in metagenomic studies [4]. Two of the data sets, Divergent and Artificial, with 35,190 and 31,867 reads, were sequenced with the GSFLX chemistry and were truncated at 220 nucleotides. They were constructed by amplifying the V5 region of the 16S rRNA gene from 23 and 90 clones, respectively, isolated from lake water. The Divergent clones were mixed in equal proportions and are separated from each other by a minimum nucleotide divergence of 7%, while the Artificial clones were mixed in abundances that span several orders of magnitude, with some of the clones differing by a single SNP. The other data set, Titanium, with 25,438 reads, was sequenced with the newer Titanium chemistry and was truncated at 400 nucleotides. It contains V45 16S rRNA genes from 89 clones isolated from Arctic soil with varying abundance and genetic distances, similar to the Artificial set.
All data sets had undergone filtering of reads deemed to be of low quality prior to application of AmpliconNoise in Q11, so for purposes of comparison, we denoised the same set of filtered reads. The presence of a small number of lowquality reads in 454 data has been previously demonstrated [17], and as we do not expect these to be well described by our error model, we encourage the use of such quality filtering before applying DADA to nontest data. As SLP and DeNoiser were already demonstrated to be less accurate than AmpliconNoise on these data, we include here DADA’s performance only relative to that of AmpliconNoise. There were six other data sets presented in Q11 of V2 regions from a gut microbial community, but these had such an overwhelming number of chimeric sequences (reported to be as high as 85% in Q11), which neither DADA nor AmpliconNoise attempts to address, that we opted not to include these data sets in our analysis.
Tuning algorithmic sensitivity
DADA employs two tunable parameters that determine how conservative or liberal the algorithm is to be in deciding whether particular sequences could have resulted from errors: Ω_{a}, and Ω_{r}, the significance levels for its abundance and read pvalues. Decisions about singletons, the sequences represented by a single read, depend on Ω_{r}, whereas decisions about sequences with several reads depend on Ω_{a}. The two values may be tuned independently to match the priority being placed on capturing the rarest and more common diversity.
Due to earlystage PCR effects discussed above, it was necessary to use Ω_{a} significance levels lower than typical values. In order to select such values, we first performed a loose clustering of each data set with larger values of Ω_{a} and Ω_{r} and then made histograms of the Ω_{a}thresholds that would be required for each cluster to be reabsorbed into some other cluster (Figure 3). If there are errors with moderate statistical deviations from our model, we expect that these will show up as a tail of increasingly small pvalues that will disappear smoothly as we lower the Ω_{a} threshold. Thus, we looked for the first large gap in these histograms that would suggest all such model departures had been captured. Such a gap occurs at Ω_{a}=10^{−15} for the Divergent data, Ω_{a}=10^{−40} for the Artificial data, and Ω_{a}=10^{−100}for the Titanium data. We used these values in the analysis that follows, but also clustered all three data sets with Ω_{a}=10^{−100}and found that the results were unchanged (see appendix Appendix 2: Ω_{a} robustness). This suggests that Ω_{a}=10^{−100}is a reasonable default value to use when clustering diversity at this scale, even though higher resolution may be achieved by the method outlined above. For nontest metagenome data that is more diverse and less oversampled, we have seen evidence that using much larger values of Ω_{a}(such as .01) may be possible without compromised accuracy, but in such cases it is always advisable to make histograms of the type above to ensure that there is not an excess of clusters that would vanish if Ω_{a} were lowered slightly.
Figure 3. Ad hoc Ω_{a }choices for the Divergent (a) and (d), Artificial (b) and (e), and Titanium (c) and (f) data sets. (a)(c) are histograms of the Ω_{a}threshold at which each cluster derived from a run of DADA with Ω_{a}=Ω_{r}=10^{−3}rejoins some other nearby cluster. Genuine genotype counts are shown in blue and false positive counts are shown in red. The first gaps in these histograms were used to pick Ω_{a}thresholds for reclustering the data, and are indicated by vertical dashed lines. (d)(f) show the Ω_{a}discrimination lines for the largest cluster in each data set (with 2294, 5479, and 1095 reads) for Ω_{a}=10^{−3}and the associated ad hoc Ω_{a}values.
We did not observe any significant departures in these data from our model that would affect the read pvalues, and it was therefore possible to maintain the interpretation of Ω_{r}as a significance threshold. As a result, for these data, which contain ≤50 preclusters that were clustered separately by DADA, we set Ω_{r}=10^{−3} so that the probability of having a false positive would be ≤5%for each data set.
False negatives and false positives
The purpose of DADA (and AmpliconNoise) is the inference of the genotypes present in the underlying sample from a set of noisy (errorcontaining) sequencing reads. There are two types of errors that such an algorithm can make: false positives in which a sample genotype is inferred that was not present in the sample, and false negatives in which the algorithm fails to infer a sample genotype that was present in the sequencing reads. The tradeoff between false positives and false negatives in the two algorithms can be controlled by the algorithmic parameters, depending on which type of error presents more of a problem to the user.
We present, in Table 1, a comparison of the false positives and false negatives for DADA and AmpliconNoise applied to the control data sets described above. Note, however, one important detail: these algorithms are designed to remediate substitution and indel errors, not all possible errors. In particular, we found that contaminants, chimeras, and pathological homopolymer errors contributed to these sequencing data sets. Using ad hoc methods, discussed in appendix Appendix 1: chimeras, contaminants, and missing or incorrect Sanger sequences, we accounted for these additional error sources, and did not penalize either algorithm for them.
Table 1. False positives and false negatives
DADA is more accurate in its inference of the sample genotypes than is AmpliconNoise on every data set. The difference is especially strong among false negatives, where DADA successfully identifies virtually all sample genotypes; DADA’s only two false negatives, both in the Artificial set, result from pathological alignment issues between sequences that differ only in the last two bases.
The differences in the nature of the false positives and negatives made by DADA and AmpliconNoise are shown in Figure 4. DADA produces one false positive in the Artificial data: a sequence with 268 reads that is a single substitution from a sample genotype with only 210 reads. Due to its vast abundance, this is unlikely to be an error, and we speculate it may represent a polymorphism that arose early in the growth of this clone. AmpliconNoise produces eight false positives in the Artificial data: all sequences with 14 reads that are (except for one) 14 substitutions away from clusters with a few hundreds of reads. Although these sequences were not atypical errors as judged by DADA due to their low abundances, AmpliconNoise calls them real as a result of setting a narrow error radius that is needed to prevent additional false negatives. The differences between the errors made by the two algorithms is less clear in the Titanium data set, but DADA outperforms AmpliconNoise in both FPs and FNs. We included AmpliconNoise’s results for the Titanium data set for both parameter settings included in Q11: the σ=.04 clustering (s25), which produces only four false negatives, leads to many false positives similar in nature to those of the Artificial clustering – low abundances and a small number of substitutions away from large clusters; the σ=.1 (s10) clustering produces many fewer false positives but misses nine sample genotypes.
Figure 4. Nature of false positives and false negatives of DADA and AmpliconNoise on Artificial and Titanium data sets. False positives are characterized by the number of reads associated with the falsely inferred genotype r, the distance to the nearest real sequence d, and the number of reads associated with that nearest real sequence R. False negatives are characterized by the number of reads that matched the missing genotype r, the distance from that missing genotype to the nearest inferred genotype d, and the number of reads associated with that nearest inferred genotype R.
Speed
We evaluated the speed of DADA applied to the Artificial data, which was used to profile AmpliconNoise in Q11 (Table 2). ESPRIT was run on a single core of an AMD Phenom II 3.2GHz running Ubuntu and DADA was run on a MacBook Pro with an Intel Core 2 Duo 2.4 GHz.
Table 2. CPU times for clustering artificial community
DADA is currently written in MATLAB, but sequence alignments and the construction of indel families were bottlenecks that we reimplemented as MEX (Matlab executable) C programs. The majority of the time to run our denoising pipeline on the Artificial data set is spent on ESPRIT’s performance of pairwise alignments during the singlelinkage preclustering step (needledist). A newer version of ESPRIT promises to be released soon that may dramatically lower this time [18]. If additional speedups are needed as data sets grow, it should be possible to replace the global alignments of ESPRIT by banded alignments that would be guaranteed to produce the same clusters if the width of the band is equal to the cluster radius, and would have have roughly linear (in sequence length) rather than quadratic, running time. Nonetheless, for these data DADA already gives a 60fold speedup over AmpliconNoise, making jobs that required a 64core cluster to run AmpliconNoise appropriate for a laptop running DADA.
As read lengths continue to grow, we expect the time complexity of DADA to be affected in two primary ways. First, because the complexity of the NeedlemanWunsch
alignment algorithm used by both DADA and ESPRIT scales with the product of the lengths of the input sequences [19] there will be a quadratic slowdown with increasing read length unless heuristics
are employed. On data sets comparable to those analyzed here, alignments consume the
majority of algorithmic time and this scaling will dominate in the near future. Second,
our current implementation for computing read pvalues has both time and space complexity
that grows very rapidly with read length (it is asymptotically
PCR substitution probabilities: symmetries and nearestneighbor contextdependence
DADA not only infers sample genotypes, it also infers the substitution error probabilities caused by the amplification and sequencing processes. The substitution error probabilities inferred by DADA for all three data sets exhibit an approximate symmetry under complementation of the two bases involved. For example, the A→G probability is close to the T→C probability. This symmetry is expected if substitution errors predominantly arise during PCR amplification because substitution errors during PCR can be the result of either of two different mispairing events (from when the sequence is being copied to the opposite strand, or from when it is being copied back), and complementary substitution errors share causal mispairing events (see Figure 5 for a schematic). As it was not imposed, and the identities of the original genotypes were not known to the algorithm, the symmetry is a highly nontrivial check on DADA’s ability to learn error probabilities without training data. Additionally, the inferred substitution probabilities were similar across the data sets, and especially so between Divergent and Artificial, which were generated with the same PCR protocols.
Figure 5. Two paths to the same error. Different mispaired bases (red) produce the same double stranded product once paired with complementary bases (green) so that each path leads to an ATG→AGG substitution error on one strand and a CAT→CCT on the other. The probability of these two errors is therefore expected to be very similar.
We also found that the nearestneighbor nucleotide context affects the probability
of substitution errors. We therefore introduced contextdependent substitution probabilities into DADA, allowing for dependence on the nucleotides immediately preceding and following the
substituted nucleotide. Such probabilities are expected to exist in reverse complementary
pairs for the same reason given for the contextindependent case (again, see Figure 5); the lir→ljr probability is expected to be similar to the
Figure 6. Error probability symmetries for Divergent (a) and (d), Artificial (b) and (e), and Titanium (c) and (f) data sets. (a)(c): contextindependent substitution error probabilities inferred by DADA with 95% confidence intervals based on binomial sampling error. Note the approximate
symmetry between i→jand
The magnitude of contextdependence for these data was moderate (most contextdependent probabilities differed by <50% from contextindependent ones) as seen in the spread of points along the diagonal in Figure 6d,e,f . As a result, maintaining separate probabilities for different contexts did not affect the inferred sample genotypes. Nonetheless, that DADA was robust to significant variation in its parameters is a strong check on the stability of its sample inference.
We have worked with data for which contextdependence is large and has a strong effect on clustering. Therefore, we leave use of contextdependence as an optional feature of DADA, either as a consistency check, or when justified by the amount and nature of the data. But a caution is in order: with modest sized data sets, or if the sequences are too similar, use of the contextdependent rates could result in overfitting and calling too many errors. However if this did occur, the expected complementarity symmetry of the inferred error probabilities would be unlikely to obtain unless the sequences were read in both directions.
Discussion
DADA explicitly incorporates read abundance when deciding whether sequences are genuine or errors; if there are many identical reads of a sequence, DADA will be more likely to infer an underlying sample genotype, even if individually those reads would be consistent with being an error from a nearby genotype. Furthermore, DADA implicitly assumes, via the error model, that reads near highly abundant sequences are far more likely to be errors. In contrast, previous methods have typically treated each read independently. AmpliconNoise partially incorporates abundance by weighting the prior probability that a read belongs to a cluster by the frequency of that cluster, but this is weaker than DADA, where dependence on cluster size shows up in a binomial coefficient (see Methods), especially for highabundance errors. By using both sequence identity and abundance in this way, DADA is able to disentangle real diversity from errors at finer scales than previous methods, even when tuned to be very conservative.
However, full incorporation of abundance information makes DADA sensitive to earlystage PCR effects and the misestimation of error probabilities. The problem of earlystage effects is particularly pronounced in these data: when clustered with Ω_{a}=Ω_{r}=10^{−3}, the Artificial data produces 68 false positives (we would have expected no false positives if the model assumptions were not violated). The majority of these sequences have 25 reads and 24 errors. Such problems would be typical of moderately oversampled PCR, the regime in which initial sample molecules are typically sampled multiple times, allowing a single error during early stages to show up in more than one read.
In lieu of an abundance statistic that appropriately compensates for this affect, we deal with this problem by lowering the sensitivity of the algorithm by tuning down Ω_{a}. Further, because the probability given to each sequence scales as the error probabilities to the power of the number of reads (see Methods), if certain error parameters are larger than estimated in certain contexts, then the statistical significance of an error with many reads can be substantially overestimated. This problem gets progressively worse for deeper data sets, as all oneaway errors begin to take on many reads. In anticipation of this problem, we have introduced nearestneighbor contextdependence of error rates (see Methods). These had no impact on the final clustering for the test data presented, but in other data sets with larger contextdependent effects, we found a reduction in diversity estimates when contextdependence was included (data not shown).
DADA is a divisive hierarchical clustering algorithm: all sequences are assigned to a single cluster that is subdivided until the clustering fits an error model. Previous methods, including AmpliconNoise and simple OTUclustering, have predominantly taken the opposite, agglomerative approach, which starts with too many clusters and merges them until some condition is met. This gives DADA a practical advantage, as the computational and space requirements (especially the number of alignments to perform and store) scale with the square of the initial number of clusters [20]. The typical problem of divisive methods – that the number of possible splittings is too large – is handled in DADA by seeding new clusters with sequences that are identified as not being errors and allowing other sequences, e.g. errors associated with the new clusters, to relocate if they become more probable by doing so.
Finally, DADA uses unsupervised learning to acquire error probabilities from the data that it is given. As PCR protocols vary in their choice of polymerase and number of rounds, these parameters vary by data set, perhaps greatly. This makes the universality of DADA’s approach especially attractive, and will be important as new sequencing methods come into use such as longer readlength and pairedend Illumina that commonly make substitution as well as indel errors [21]. While it now relies on training data to establish error parameters, AmpliconNoise could be embedded in the same procedure of estimating error probabilities after each successive round of clustering, but this would multiply the computation requirements by a factor of the number of rounds of reestimation, compounding the problem of its slower speed.
Conclusions
OTUs serve as a rough analogue for microbes of the more clearly defined taxonomic groups of higher organisms. However, the repurposing of the OTU concept to the problem of inferring sample genotypes from errorprone metagenomic sequence data has serious and inherent shortcomings. The absence of an error model causes estimates of diversity, especially species richness, to depend strongly on experimental variables such as the size of the data set, the length of the region sequenced, and the details of the PCR/sequencing chemistry. These shortcomings are not amenable to simple fixes; it is not possible to separate real diversity from errors using an OTU approach when the diversity and the errors exist at similar scales (as measured by Hamming distance), as is the case in many metagenomic studies. PyroNoise and AmpliconNoise have demonstrated the usefulness of denoising sequence data with statistical, physicallybased error models. These methods are based on the classical statistical technique of expectationmaximization. We have presented an alternate approach, DADA, which is more targeted to the particular task of producing conservative estimates of diversity from noisy sequence data. It is much faster and more capable of resolving finescale diversity while maintaining a lower false positive rate.
We did not achieve our goal of complete freedom from ad hoc parameters in this work. Even though Ω_{a}, our input parameter, has a simple probabilistic meaning that is data set independent, there are corrections to our PCR model, and as a result Ω_{a} takes on an ad hoc quality in this analysis. Nonetheless, Ω_{a}can be coarsely tuned from the data itself in the way shown. Alternatively, for conservative diversity estimates, Ω_{a} may be set to very small values (such as 10^{−100}), and the resolution of the algorithm may be directly quantified. DADA not only guesses what is there, but knows what would have been missed if it were present, making Ω_{a} ad hoc but not arbitrary.
Much work remains to be done, and it is not yet clear how the algorithms will fare with extremely rich finescale diversity as occurs for the antibody repertoire of Bcells and Tcells of the human immune system [22,23]. DADA must be equipped with statistics that correctly describe the abundance distribution of sequencing errors when a realistic model of PCR is used in which some reads are the result of shared lineages. More sophisticated methods for chimera detection that explicitly parameterize the chimera formation process analogously to the substitution and indel processes are also needed. Finally, these methods must be fully adapted and tested on sequencing platforms other than Roche’s 454.
Methods
General notation
From a sequencing data set
Treatment of insertions and deletions: the construction of indel families
In addition to substitution errors, reads acquire insertions and deletions (indels)
during amplification and sequencing. Both substitutions and indels could be used to
parameterize an error model, but here we focus on substitutions and do not attempt
to characterize the statistics of indels. Instead, we collapse together all the reads
of sequences within each B_{α}that differ from each other only by the location of indels in their alignments to
G_{α}, forming subsets of each B_{α} that we call indel families. We call the indel families
Alignments between sequences and each G_{α} in this paper took place with a scoring matrix of 5 + logT(to make them comparable with NCBI BLAST’s NUC.4.4 matrix [24]), where T, introduced below, is a matrix of substitution error probabilities. We used a gap penalty of −4 and a hompolymer gap penalty of −1. The gap penalty had to be less than half the smallest mismatch score or alignments would favor a pair of indels to that mismatch. The worst mismatch score tended to be about −6, and so −4 was chosen as a gap penalty to allow as many gaps as possible without making any mismatches prohibited within alignments.
The independence between substitution errors on different reads implies a binomial distribution for the number of reads of each family
If the occurrence of substitution errors on different reads are independent events, then each read of genotype G_{α} has an i.i.d. onetrial multinomial distribution with parameters Λ={λ_{yα}}, which we call the genotype error probabilities, to belong to each indel family s_{y}. Λ also parameterizes the probability distribution for R_{y}, the number of reads of family y: if s_{y}⊆B_{α}, then because R_{y} is the sum of ρ_{α} Bernoulli random variables each with success probability λ_{yα}, it follows the binomial distribution, R_{y}∼Bin(λ_{yα},ρ_{α}). The assumption of independence between reads does not hold if early round PCR errors may be sampled multiple times in the final sequence data. Then, if we condition on having observed a particular error on some other read, the probability to observe it additional times is increased.
Λ may be constructed from simple nucleotide transition matrices
If the occurrence of substitution errors on different sites of the same read are independent events that do not depend upon the absolute position of the sites, then we can write each λ_{yα} in terms of a homogeneous Markov chain T, whose elements we call nucleotide error probabilities. The simplest useful model of this sort is the 4×4 transition matrix T_{ij}=P(ji) with, for example P(CA) the probability for taking nucleotide A in the sample to C in the data (i,j will always index nucleotides), and this is the model used by AmpliconNoise. These probabilities generate the genotype error probabilities Λ via
where α_{n} and y_{n} denote the n^{th} nucleotides of G_{α} and s_{y} (also let λ_{xα}=λ_{yα} for all xs_{x}∈s_{y}, used in Algorithm Algorithm 1).
If the nucleotide error probabilities at each site can depend upon the nearestneighbor
flanking nucleotides, we can keep track of a transition matrix T^{(L,R)} for each possible (L,R) pair of flanking nucleotides such that
Assessing fit with an error model via tail probabilities
In order to assess whether
p_{y}: the abundance pvalue
Call
Given the definition of p_{y}above,
We refer to this as the abundance pvalue because it evaluates the probability of having observed more extreme abundances under
the null hypothesis that each r_{y}was generated by the error model. Because one abundance pvalue is generated for each
indel family, we use a Bonferroni correction and compare each p_{y}with
If we had not conditioned on having observed at least one read of each family, then
the unobserved families would not have born any significance (they would all have p_{y}=1), but before looking at the data these families could have been significant. This would create a difficulty in choosing an appropriate
multiple hypothesis correction; a naive Bonferroni correction of
q_{α}: the read pvalue
For each cluster B_{α}, we compute the probability q_{α}, which we call the read pvalue, that there is at least one read with a genotype error probability at least as small
as
where e iterates over all
Vectors of λ_{γ} and m_{γ}(α) can be computed starting with γrepresenting the more common errors and extended to more rare errors as needed to
compute pvalues for smaller
Maximum likelihood estimate (mle) of error probabilities
After forming a partition
where α_{n} and y_{n} denote the n^{th} aligned nucleotides of G_{α}and s_{x}. For the case without contextdependence, the likelihood may be rewritten as
where N_{ij} is the total number of js in
where N_{i}=∑_{j}N_{ij} (the diagonal
where i≠j, L,R are any pair of left and right flanking nucleotides, N_{LijR} is the total number of LjR codons in
Algorithm for inferring the sample genotypes and error probabilities
The pvalues p_{y} and q_{α} are the basis for an algorithm that alternatively updates
Algorithm 1
DADA Sequence clustering algorithm [15]
T^{0}=
t←1
repeat
repeat
if
start a new cluster within
repeat
update
each s_{x}joins B_{α}where
until
update {p_{y}} and {q_{α}}
until
t←t + 1
untilT has converged
There are three levels of nesting, each beginning with a repeat statement in Algorithm Algorithm 1. From outer to inner, we give a qualitative description of their purpose:
1. Starting with T^{0}, the maximum likelihood nucleotide error probabilities given the trivial partition
2. For each T^{t}, the next loop begins with the trivial partition,
3. After adding a new block
Appendix 1: chimeras, contaminants, and missing or incorrect Sanger sequences
There are disagreements between the Sanger sequences of the clonal isolates used to construct the data sets and the denoised sequences of DADA and AmpliconNoise that are due to sources other than PCR substitutions and pyrosequencing errors. These include contamination, chimeric sequences that result from the coamplification of genomes with regions over which they exactly match, Sanger errors, the absence of any reads of two sample genotypes, and disagreements between the Sanger sequences and majority of the 454 reads about the lengths of several homopolymers (for example, not a single 454 read matched four of the Sanger sequences while many were identical except for the presence of a single deletion on a G homopolymer). In order to evaluate the relative performance of DADA and AmpliconNoise as denoising algorithms, it was necessary to identify which disagreements between Sanger and denoised sequences were due to these sources and which, falling outside these categories, were due to algorithmic shortcomings. We chose criteria for classifying errors of these types and applied them to the sequences denoised by DADA and AmpliconNoise (Table 3).
Table 3. Additional sources of noise and false positives
We began by correcting possible errors in the Sanger sequences. In the Divergent and Artificial data sets there were disagreements between very high abundance denoised sequences and their nearest neighbor Sanger sequences (12/23 Divergent Sanger sequences and 63/90 Artificial Sanger sequences). The denoised sequences were a consensus of thousands of pyrosequencing reads and did not differ from the Sanger sequences near homopolymers; rather, all disagreements were nonhomopolymer related deletions near the starts of the reads. It was confirmed (Chris Quince, personal communication) that all bases of Sanger sequences aligning to sites within 13 nucleotides (nts) of the forward primer of the pyrosequencing reads had been removed in Q11, and so were likewise removed in all our analysis. In the Titanium data, the denoised sequences closest to eight of the Sanger sequences had over 100 reads but differed from them by one or two homopolymer deletions at several long homopolymers. DADA and AmpliconNoise (s10 and s25) agreed on the presence of the deletions in all of these sample genotypes, and there were more copies of the errorcontaining sequences than the Sanger sequences in the raw data (Table 4), suggesting either an error probability greater than 50% for the combined amplification/pyrosequencing process or problems with the Sanger sequences. Therefore, we did not consider these disagreements to be false positives or false negatives for either algorithm.
Table 4. Titanium genotypes with more errorcontaining than Sanger sequence matching reads
Next we identified chimeras: sequences consisting of two sections with one section a close match to one sample genotype and the other a close match to a second sample genotype. These can be produced in substantial quantities by PCR [25]. Analogously to Q11, for each denoised sequence we computed the Hamming distance to the nearest Sanger sequence and to the nearest exact chimera by considering all possible breakpoints between all pairs of sequences of higher abundance (a chimera will have fewer reads than its parents unless it acquires substantial PCR bias). For a denoised sequence to be classified as a chimera, we required that it be at least 3 nts closer to the nearest exact chimera than the nearest sample genotype and within 5 nts of the optimal chimera (also analogous to the procedure used in Q11). We waived the 3 nt improvement criteria for denoised sequences that were identical to exact chimeras, which occurred for some particularly highly abundant chimeras between closely related sample genotypes. All data sets had a large number of chimeras amongst their denoised reads, with Titanium having more chimeras than sample genotypes (both algorithms), highlighting how essential accurate chimera identification is in tandem with the correction of PCR and sequencing errors.
Finally, we found several sequences too far from any sample genotypes or exact chimeras to be explained by being errors away from either. Some of these sequences were similar to previously observed sequences found on GenBank (Table 5). We classified as a contaminant any sequence within 2 nts of a GenBank sequence and at least 5 nts closer to a GenBank sequence than any sample genotype or chimera. We found a mixture of contaminants likely to come from the original sample (lake water bacteria), and contaminants that may have entered the sample during processing and sequencing (bacteria previously observed in human skin, a human mouth, and soil samples). These contaminants were not previously mentioned in Q11 but were straightforward to detect when looking at DADA’s denoised sequences, in part because having a smaller pool of algorithmic false positives makes identifying contaminants much easier.
Table 5. Contaminants
In classifying false negatives, we sought to evaluate the ability of the algorithms to detect the presence of genuine diversity in the pyrosequencing reads. However, not all clones used to construct the samples in Q11 had exact matches amongst the pyrosequencing reads: one Sanger sequence in the Artificial data set (#69 in Q11) was 29 nts away from the nearest 454 read and one Sanger sequence in the Titanium data set (#66 in Q11) was 61 nts away from the nearest 454 read. We assumed that these were missing from the 454 data, and they do not contribute to the false negatives of either algorithm. Further, a number of clones were identical to each other up to the point of truncation of the pyrosequencing reads. Finally, a number of the Titanium clones differed from each other only by the presence of Ns, bases that Sanger was unable to resolve. In such cases, we collapsed clones together and assumed the nonN containing Sanger sequence was correct. Table 6 gives the number of distinct (up to Ns) clones that are present in the data, and how many had been used to construct the sample.
Table 6. Detectable genotypes
Several aspects of this postprocessing pipeline – especially contaminant identification– utilize knowledge of the sample genotypes and do not constitute useful methods for nontest data. Our approach to chimera identification does not utilize sample genotype information, but requires more development to be applied to nontest data: it does not search for higherorder chimeras that are combinations of three or more parental sequences, the criteria for being labelled as a chimera do not scale with error rates and read lengths, and no attempt has been made to realistically model the chimera formation process. Our goal has been only the isolation of errors due to PCR and sequencing error.
Appendix 2: Ω_{a}robustness
To assess whether DADA’s performance in this paper was the result of the finetuning of Ω_{a}, we evaluated each data set under all three Ω_{a}values. The results are given in Table 7 and demonstrate that the same results would have been achieved by using Ω_{a} = 10^{−100} for all three data sets.
Table 7. False positives and false negatives for each data set with Ω_{a} = {10^{−15},10^{−40},10^{−100}} and Ω_{r} = 10^{−3}
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MJR designed the algorithm, wrote the software, performed the analyses, and wrote the paper. BJC, DSF, and SPH provided assistance and advice with algorithm design, comparative analysis, and the paper. All authors read and approved the final manuscript.
Acknowledgements
Thanks to Chris Quince and Sue Huse for providing and helping with clarifications about test data. Thanks also to Yijun Sun for help with the ESPRIT clustering algorithm. Finally, thanks again to Chris Quince and to Kabir Peay for useful comments on the manuscript. This work was supported in part by the NSF under grant DMS1120699. SH is supported by grant NIHR01GM086884.
References

Cheung MK, Au CH, Chu KH, Kwan HS, Wong CK: Composition and genetic diversity of picoeukaryotes in subtropic coastal waters as revealed by 454 pyrosequencing.
ISME J 2010, 4:10531059. PubMed Abstract  Publisher Full Text

Iwai S, Chai B, Sul WJ, Cole JR, Hashsham SA, Tiedje JM: Genetargetedmetagenomics reveals extensive diversity of aromatic dioxygenase genes in the environment.
ISME J 2010, 4:279285. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Teixeria LCRS, Peixoto RS, Cury JC, Sul WJ, Pellizari VH, Tiedje J, Rosado AS: Bacterial diversity in rhizosphere soil from Antarctic vascular plants of Admiralty Bay, maritime Antarctica.
ISME J 2010, 4:9891001. PubMed Abstract  Publisher Full Text

Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML: Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing.
PLoS Genet 2008, 4:e1000255. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wilmes P, Simmons SL, Denef VJ, Banfield JF: The dynamic genetic repertoire of microbial communities.
FEMS Microbiol Rev 2009, 33:109132. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML: Microbial population structures in the deep marine biosphere.
Science 2007, 318:97100. PubMed Abstract  Publisher Full Text

Turnbaugh PJ, Quince C, Faith JJ, McHardy AC, Yatsunenko T, Niazi F, Affourtit J, Egholm M, Henrissat B, Knight R, Gordon JI: Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins.
PNAS 2010, 107:75037507. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored rare biosphere.
PNAS 2006, 103:1211512120. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kunan V, Engelbrektson A, Ochman H, Hugenholtz P: Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates.
Environ Microbiol 2010, 12:118123. PubMed Abstract  Publisher Full Text

Zhou J, Wu L, Deng Y, Zhi X, Jiang Y, Tu Q, Xie J, Nostrand JDV, He Z, Yang Y: Reproducibility and quantitation of amplicon sequencingbased detection.
ISME J 2011, 5:13031313. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huse SM, Welch DM, Morrison HG, Sogin ML: Ironing out the wrinkles in the rare biosphere through improved OTU clustering.
Environ Microbiol 2010, 12:18891898. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Reeder J, Knight R: Rapidly denoising pyrosequencing amplicon reads by exploiting rankabundance distributions.
Nat methods 2010, 7:668669. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IA, Read LF, Sloan WT: Accurate determination of microbial diversity from 454 pyrosequencing data.
Nat methods 2009, 6:639641. PubMed Abstract  Publisher Full Text

Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ: Removing noise from pyrosequenced amplicons.
BMC Bioinf 2011, 12:38. BioMed Central Full Text

2012.

Sun Y, Cai Y, Yu F, Farrell MF, McKendree W, Farmerie W: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences.
Nucleic Acids Res 2009, 37:e76. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing.
Genome Biol 2007, 8:R143. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Cai Y, Sun Y: ESPRITTree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.
Nucleic Acids Res 2011, 39:e95. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequences of two proteins.
J Mol Biol 1970, 48:443453. PubMed Abstract  Publisher Full Text

Fraley C, Raftery AE: How many clusters? which clustering method? answers via modelbased cluster analysis.
Comput J 1998, 41:578588. Publisher Full Text

Yang X, Aluru S, Dorman KS: Repeataware modeling and correction of short read errors.
BMC Bioinf 2011, 12:S52. BioMed Central Full Text

Boyd SD, Marshall EL, Merker JD, Maniar JM, Zhang LN, Sahaf B, Jones CD, Simen BB, Hanczaruk B, Nguyen KD, Nadeau KC, Egholm M, Miklos DB, Zehnder JL, Fire AZ: Measurement and clinical monitoring of human lymphocyte clonality by massively parallel VDJ pyrosequencing.
Sci Translational Med 2009, 1:12ra23. Publisher Full Text

Wang C, Sanders CM, Yang Q, Schroeder HW, Wang E, Babrzadeh F, Gharizadeh B, Myers RM, Hudson JR, Davis RW, Han J: High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets.

Todd Lowe: NUC.4.4 score matrix, NCBI.
1992.

Lahr DJG, Katz LA: Reducing the impact of PCRmediated recombination in molecular evolution and environmental studies using a newgeneration highfidelity DNA polymerase.
BioTechniques 2009, 47:857866. PubMed Abstract  Publisher Full Text