Department of Applied Physics, Stanford University, CA, US

Departments of Biology and Bioengineering, Stanford University, CA, US

Department of Statistics, Stanford University, CA, US

Abstract

Background

PCR amplification and high-throughput sequencing theoretically enable the characterization of the finest-scale diversity in natural microbial and viral populations, but each of these methods introduces random errors that are difficult to distinguish from genuine biological diversity. Several approaches have been proposed to denoise these data but lack either speed or accuracy.

Results

We introduce a new denoising algorithm that we call

Conclusions

Background

The potential of high-throughput sequencing as a tool for exploring biological diversity is great, but so too are the challenges that arise in its analysis. These technologies have made possible the characterization of very rare genotypes in heterogeneous populations of DNA at low cost. But when applied to a metagenomic sample, the resulting raw data consist of an unknown mixture of genotypes that are convolved with errors introduced during amplification and sequencing.

There are two broad approaches to high-throughput sequencing of metagenomes: in

By trading off a broad survey of gene content for greater sequencing depth at the sampled loci, amplicon sequencing has the potential to detect the rarest members of the sampled community, but errors interfere more profoundly. Unlike genome assembly projects, where one needs only to determine the consensus base at each locus or decide whether a SNP is present in a population, the space of possible distributions for the sample genotypes and frequencies is effectively infinite. As a result, ambiguities in genome projects can usually be resolved by increasing the amount of data, whereas increasing depth (as much as 10^{6} in recent studies

The analysis of amplicon sequence data typically begins with the construction of OTUs (operational taxonomic units), clusters of sequences that are within a cutoff in Hamming distance from one another. OTUs serve to collapse the complete set of sequences into a smaller collection of representative sequences – one for each OTU – and corresponding abundances based on the number of reads falling within each cluster. OTUs were developed as a tool for classifying microbial species, but have also been repurposed to the task of correcting errors; the sequences within an OTU are typically interpreted as a taxonomic grouping without specifying whether the variation within an OTU represents errors or real diversity on a finer scale than that chosen to define the OTU. If the scale of the noise is smaller than that of the clusters, then the construction of OTUs will appropriately group error-containing sequences together with their true genotype. However, as sequencing depth increases, low probability errors outside the OTU radius will start to appear, and will be incorrectly assigned to their own OTU. Early studies using this approach on high-throughput metagenome data sets reported large numbers of low-abundance, previously unobserved genotypes that were collectively dubbed the

In response, a variety of approaches to disentangling errors from actual genetic variation have been proposed recently

We believe that the way forward is to model the error process and evaluate the validity of individual sequences in the context of the full metagenomic data set, crucially including the abundances (number of reads) corresponding to each sequence. Major progress in this direction has been made recently by

We build on the error-modeling approach pioneered in

Results

Model and algorithm

We introduce a first-order model of the error process by assuming (1) each sequence read originates from a distinct DNA molecule in the sample, and therefore that the presence of errors on

Under these conditions, the numbers of reads (abundances) of the error-containing sequences derived from a sample genotype follow the multinomial distribution, and the abundance

These statistics serve as the basis of a sequence-clustering algorithm in which (1) reads are assigned to clusters, (2) a putative sample genotype is inferred for each cluster, (3) reads are reassigned to the cluster for which they are most likely to have resulted as errors from the inferred sample genotype, (4) the two p-value statistics are computed given the inferred sample genotypes and the clustering of the sequences (5) additional clusters are created if the clustering is statistically inconsistent with the error model (as suggested by small p-values).

The full algorithm (Figure

** schematic.** The basic structure of

The p-values

We introduce two statistics for deciding that particular sequences ^{∗}, below which reads are decided not to be errors by _{
r
}, the probability of having observed at least one read more unlikely than ^{∗}. The _{
a
}, the probability that at least one sequence should have been as overabundant as the most overabundant sequence. The abundance p-value gives

Figure _{
a
}line represents a lower bound on

Discrimination plots for a typical cluster in the

**Discrimination plots for a typical cluster in the ****data set with 4691 reads. (a)** simulated errors drawn from the error model and **(b)** the real errors in the cluster. Sequences (diamonds) are characterized by abundance and the probability _{A→G}, so that values can be interpreted as an _{a}and _{r} provided by the user, _{a} = .01 significance level; we posit that early round PCR effects are a suitable candidate to explain these departures from the error model.

For both the real and simulated data, the abundance p-value does a good job of tracking the form of the abundances of the errors, and the read p-value sits to the right of all observed data. For the real data, a small number of errors sit on or above the abundance discrimination line. Such errors were individually not expected to be observed at all, but ended up with a small number of reads larger than one. This pattern was observed across many clusters, and we believe that it reveals the presence of small violations of our assumption of the independence between reads. In particular, in a regime where the ratio of the number of error-free reads to the number of DNA molecules in the sample that act as the basis for amplification is of order one or larger, then errors during early stages of PCR may be sampled multiple times in the sequence data. As a result, the distribution for the number of reads of these errors may fall off much more slowly than what our model suggests. To deal with this effect in this paper, we lowered the _{
a
}threshold using an ad hoc method (discussed below) to prevent excess false positives. Doing so did not affect _{
a
} could be limiting for samples with even finer-scale diversity. Further analytics that model PCR as a branching process improve this current ad hoc threshold (unpublished work).

Treatment of insertions and deletions

Treating indels in this way does not affect the accuracy of

Preclustering

Prior to our probabilistic sequence clustering we divided the raw data into coarse 3% single-linkage clusters (with indels not contributing to distance), subsets for which each sequence is ≤3

Clustering

Each precluster is partitioned into sets of sequences that are conjectured to contain all errors arising from different sample genotypes. This partition is initialized to a single cluster containing all sequences. Two procedures then alternate. First, the indel family most unlikely to have resulted from errors is split off into a new cluster. Sequences then move between clusters based on the probability that they were generated as errors by each one, and the consensus sequence for each cluster is updated until there are no remaining reassignments that can improve the probability of the data. This second step is analogous to the assignment and update steps of standard

Accuracy

We evaluated the accuracy of

All data sets had undergone filtering of reads deemed to be of low quality prior to application of

Tuning algorithmic sensitivity

_{
a
}, and _{
r
}, the significance levels for its abundance and read p-values. Decisions about singletons, the sequences represented by a single read, depend on _{
r
}, whereas decisions about sequences with several reads depend on _{
a
}. The two values may be tuned independently to match the priority being placed on capturing the rarest and more common diversity.

Due to early-stage PCR effects discussed above, it was necessary to use _{
a
} significance levels lower than typical values. In order to select such values, we first performed a loose clustering of each data set with larger values of _{
a
} and _{
r
} and then made histograms of the _{
a
}thresholds that would be required for each cluster to be reabsorbed into some other cluster (Figure _{
a
} threshold. Thus, we looked for the first large gap in these histograms that would suggest all such model departures had been captured. Such a gap occurs at _{
a
}=10^{−15} for the _{
a
}=10^{−40} for the _{
a
}=10^{−100}for the _{
a
}=10^{−100}and found that the results were unchanged (see appendix Appendix 2: _{
a
} robustness). This suggests that _{
a
}=10^{−100}is a reasonable default value to use when clustering diversity at this scale, even though higher resolution may be achieved by the method outlined above. For non-test metagenome data that is more diverse and less oversampled, we have seen evidence that using much larger values of _{
a
}(such as .01) may be possible without compromised accuracy, but in such cases it is always advisable to make histograms of the type above to ensure that there is not an excess of clusters that would vanish if _{
a
} were lowered slightly.

Ad hoc _{a}choices for the

**Ad hoc **_{a }**choices for the ****(a) and (d), ****(b) and (e), and ****(c) and (f) data sets. (a)**-**(c)** are histograms of the _{a}threshold at which each cluster derived from a run of _{a}=_{r}=10^{−3}rejoins some other nearby cluster. Genuine genotype counts are shown in blue and false positive counts are shown in red. The first gaps in these histograms were used to pick _{a}thresholds for reclustering the data, and are indicated by vertical dashed lines. **(d)**-**(f)** show the _{a}discrimination lines for the largest cluster in each data set (with 2294, 5479, and 1095 reads) for _{a}=10^{−3}and the associated ad hoc _{a}values.

We did not observe any significant departures in these data from our model that would affect the read p-values, and it was therefore possible to maintain the interpretation of _{
r
}as a significance threshold. As a result, for these data, which contain ≤50 preclusters that were clustered separately by _{
r
}=10^{−3} so that the probability of having a false positive would be ≤5

False negatives and false positives

The purpose of

We present, in Table

**DADA**

**AmpliconNoise**

**Sample**

**False Pos**

**False Neg**

**False Pos**

**False Neg**

Divergent

0

0

2

0

Artificial

1

2

8

7

Titanium (s10)

6

0

8

9

Titanium (s25)

23

4

The differences in the nature of the false positives and negatives made by

Nature of false positives and false negatives of

**Nature of false positives and false negatives of ****and ****on ****and ****data sets.** False positives are characterized by the number of reads associated with the falsely inferred genotype

Speed

We evaluated the speed of

**Function**

**CPU time (seconds)**

The CPU times for a few significant subroutines within

ESPRIT

kmerdist

74.80

needledist

924.68

Total

1002.68

DADA

N-W alignments

97.81

read p-values

58.84

Total

296.64

ESPRIT+DADA

1.30×10^{3}

AmpliconNoise

7.57×10^{4}

As read lengths continue to grow, we expect the time complexity of

PCR substitution probabilities: symmetries and nearest-neighbor context-dependence

Two paths to the same error

**Two paths to the same error.** Different mispaired bases (red) produce the same double stranded product once paired with complementary bases (green) so that each path leads to an

We also found that the nearest-neighbor nucleotide context affects the probability of substitution errors. We therefore introduced

Error probability symmetries for

**Error probability symmetries for ****(a) and (d), ****(b) and (e), and ****(c) and (f) data sets. (a)**-**(c)**: context-independent substitution error probabilities inferred by **(d)**-**(f)**: All 96 reverse-complementary pairs of context-dependent error probabilities inferred by **red = **(**cyan=**(**green=**(**black=**(**blue=**(**purple=**(

The magnitude of context-dependence for these data was moderate (most context-dependent probabilities differed by <50

We have worked with data for which context-dependence is large and has a strong effect on clustering. Therefore, we leave use of context-dependence as an optional feature of

Discussion

However, full incorporation of abundance information makes _{
a
}=_{
r
}=10^{−3}, the

In lieu of an abundance statistic that appropriately compensates for this affect, we deal with this problem by lowering the sensitivity of the algorithm by tuning down _{
a
}. Further, because the probability given to each sequence scales as the error probabilities to the power of the number of reads (see Methods), if certain error parameters are larger than estimated in certain contexts, then the statistical significance of an error with many reads can be substantially overestimated. This problem gets progressively worse for deeper data sets, as all one-away errors begin to take on many reads. In anticipation of this problem, we have introduced nearest-neighbor context-dependence of error rates (see Methods). These had no impact on the final clustering for the test data presented, but in other data sets with larger context-dependent effects, we found a reduction in diversity estimates when context-dependence was included (data not shown).

Finally,

Conclusions

OTUs serve as a rough analogue for microbes of the more clearly defined taxonomic groups of higher organisms. However, the repurposing of the OTU concept to the problem of inferring sample genotypes from error-prone metagenomic sequence data has serious and inherent shortcomings. The absence of an error model causes estimates of diversity, especially species richness, to depend strongly on experimental variables such as the size of the data set, the length of the region sequenced, and the details of the PCR/sequencing chemistry. These shortcomings are not amenable to simple fixes; it is not possible to separate real diversity from errors using an OTU approach when the diversity and the errors exist at similar scales (as measured by Hamming distance), as is the case in many metagenomic studies.

We did not achieve our goal of complete freedom from ad hoc parameters in this work. Even though _{
a
}, our input parameter, has a simple probabilistic meaning that is data set independent, there are corrections to our PCR model, and as a result _{
a
} takes on an ad hoc quality in this analysis. Nonetheless, _{
a
}can be coarsely tuned from the data itself in the way shown. Alternatively, for conservative diversity estimates, _{
a
} may be set to very small values (such as 10^{−100}), and the resolution of the algorithm may be directly quantified. _{
a
} ad hoc but not arbitrary.

Much work remains to be done, and it is not yet clear how the algorithms will fare with extremely rich fine-scale diversity as occurs for the antibody repertoire of B-cells and T-cells of the human immune system

Methods

General notation

From a sequencing data set _{
x
} is the number of individual reads of each distinct sequence _{
x
}, we would like to construct an estimate _{
x
}} where each _{
α
}, and notate the number of reads assigned to _{
α
}by _{
x
} can reside in only one _{
α
} and it is assumed that _{
α
} is the source of all _{
x
} in _{
α
}, this framework does not allow for multiple _{
α
} to contribute reads to the same _{
x
}. Allowing the latter is likely to affect

Treatment of insertions and deletions: the construction of

In addition to substitution errors, reads acquire insertions and deletions (indels) during amplification and sequencing. Both substitutions and indels could be used to parameterize an error model, but here we focus on substitutions and do not attempt to characterize the statistics of indels. Instead, we collapse together all the reads of sequences within each _{
α
}that differ from each other only by the location of indels in their alignments to _{
α
}, forming subsets of each _{
α
} that we call **s**
_{
y
} refers either to a subset of some _{
α
} or the sequence identical with _{
α
} except for the substitution errors of its constituents, and **r**
_{
y
} is the number of reads in the family. The **r**
_{
y
} of each indel family will be used to test whether _{
α
} was not too improbable under an error model of substitution errors.

Alignments between sequences and each _{
α
} in this paper took place with a scoring matrix of 5 + log

The independence between substitution errors on different reads implies a binomial distribution for the number of reads of each family

If the occurrence of substitution errors on different reads are independent events, then each read of genotype _{
α
} has an i.i.d. one-trial multinomial distribution with parameters Λ={_{
yα
}}, which we call the **s**
_{
y
}. Λ also parameterizes the probability distribution for _{
y
}, the number of reads of family **s**
_{
y
}⊆_{
α
}, then because _{
y
} is the sum of _{
α
} Bernoulli random variables each with success probability _{
yα
}, it follows the binomial distribution, _{
y
}∼_{
yα
},_{
α
}). The assumption of independence between reads does not hold if early round PCR errors may be sampled multiple times in the final sequence data. Then, if we condition on having observed a particular error on some other read, the probability to observe it additional times is increased.

Λ may be constructed from simple nucleotide transition matrices

If the occurrence of substitution errors on different sites of the same read are independent events that do not depend upon the absolute position of the sites, then we can write each _{
yα
} in terms of a homogeneous Markov chain _{
ij
}=

where _{
n
} and _{
n
} denote the ^{
th
} nucleotides of _{
α
} and **s**
_{
y
} (also let _{
xα
}=_{
yα
} for all _{
x
}∈**s**
_{
y
}, used in Algorithm Algorithm 1).

If the nucleotide error probabilities at each site can depend upon the nearest-neighbor flanking nucleotides, we can keep track of a transition matrix ^{(L,R)} for each possible (

Assessing fit with an error model via tail probabilities

In order to assess whether _{
y
}and _{
α
}: _{
y
} is the probability of having seen at least **r**
_{
y
}reads of **s**
_{
y
} given that we saw at least one and _{
α
}is the probability of having seen at least one read with a genotype error probability at least as small as the smallest genotype error probability of an observed indel family in _{
α
},

_{
y
}:

Call **s**
_{
y
}given that we observed at least one:

Given the definition of _{
y
}above,

We refer to this as the **r**
_{
y
}was generated by the error model. Because one abundance p-value is generated for each indel family, we use a Bonferroni correction and compare each _{
y
}with _{
a
} is a joint significance threshold that is provided to

If we had not conditioned on having observed at least one read of each family, then the _{
y
}=1), but before looking at the data these families _{
α
} the length of _{
α
}, which treats all possible families as tested hypotheses, would deprive the p-value of any statistical power. Conditioning on _{
y
}>0 and evaluating only the observed sequences avoids this complication. However, any family with **r**
_{
y
}=1 obtains _{
y
}=1 regardless of the smallness of _{
yα
}, which necessitates our second statistic, _{
α
}.

_{
α
}:

For each cluster _{
α
}, we compute the probability _{
α
}, which we call the _{
α
}be a random variable representing the smallest genotype error probability when _{
α
} reads of _{
α
} are generated according to Λ. Then

where _{
eα
}are the genotype error probabilities of these sequences. Evaluating the sum in this form would be computationally wasteful; instead we iterate over sets of sequences that share the same types of substitution errors. We index these sets by 4×4 off-diagonal matrices _{
i≠j
}specify the number of i’s on a genotype that appear as j’s on the sequence. The genotype error probability for sequences of type _{
α
}is _{0α
}×_{
γ
} with _{
αi
} is the number of nucleotides of type _{
α
}) and _{
α
}, which we call the _{
γ
}(_{
α
}. This is computed by taking a product over multinomial coefficients: _{
α
}becomes

Vectors of _{
γ
} and _{
γ
}(_{
γ
}(_{
α
}that would result from each of these in order to approximate the _{
α
} that would result from the exact base composition of _{
α
}. Because one _{
α
} is generated for each _{
α
}, the _{
α
} are then compared with _{
r
} is another joint significance threshold provided to

Maximum likelihood estimate (mle) of error probabilities

After forming a partition

where _{
n
} and _{
n
} denote the ^{
th
} aligned nucleotides of _{
α
}and _{
x
}. For the case without context-dependence, the likelihood may be rewritten as

where _{
ij
} is the total number of **
js
** in

where _{
i
}=∑_{
j
}
_{
ij
} (the diagonal

where _{
LijR
} is the total number of _{
LiR
} = ∑_{
j
}
_{
LijR
}.

Algorithm for inferring the sample genotypes and error probabilities

The p-values _{
y
} and _{
α
} are the basis for an algorithm that alternatively updates ^{
t
}, in order to improve the likelihood of the data. This is similar to the ^{
t
}. The algorithm requires two user inputs, _{
a
}and _{
r
}, which are the joint significance thresholds for the abundance and read p-values.

Algorithm 1

^{0}=

**repeat**

**repeat**

**if**

start a new cluster within

**repeat**

update

each _{
x
}joins _{
α
}where

**until**

update {_{
y
}} and {_{
α
}}

**until**

**until****
T
** has converged

There are three levels of nesting, each beginning with a **repeat** statement in Algorithm Algorithm 1. From outer to inner, we give a qualitative description of their purpose:

1. Starting with ^{0}, the maximum likelihood nucleotide error probabilities given the trivial partition **
T
** until

2. For each ^{
t
}, the next loop begins with the trivial partition, _{
y
}} and {_{
α
}} do not allow rejection of the error model at joint significance levels _{
a
}and _{
s
}. New

3. After adding a new block _{
α
}, is also updated if a cluster _{
α
}has a new consensus sequence. This continues until sequences cease changing clusters.

Appendix 1: chimeras, contaminants, and missing or incorrect Sanger sequences

There are disagreements between the Sanger sequences of the clonal isolates used to construct the data sets and the denoised sequences of

**DADA**

**AmpliconNoise**

**Sample**

**Denoised**

**Clone**

**Chim**

**Contam**

**Other**

**Denoised**

**Clone**

**Chim**

**Contam**

**Other**

For each data set and both algorithms: the total number of denoised sequences, the number that matched one of the Sanger sequenced clones, the number classified as chimeras, the number classified as contaminants, and all other false positives.

Divergent

43

23

18

2

0

51

23

23

3

2

Artificial

65

50

14

0

1

73

44

21

0

8

Titanium (s10)

274

80

185

3

6

163

71

82

2

8

Titanium (s25)

304

76

203

2

23

We began by correcting possible errors in the Sanger sequences. In the

**Reads of nearby pyrosequence**

**Reads of sanger sequence**

**Errors**

The difference between the Sanger and ^{
rd
}base with 6

5

0

70

0

75

1

21

18

14

0

80

0

77

3

80

1

Next we identified chimeras: sequences consisting of two sections with one section a close match to one sample genotype and the other a close match to a second sample genotype. These can be produced in substantial quantities by PCR

Finally, we found several sequences too far from any sample genotypes or exact chimeras to be explained by being errors away from either. Some of these sequences were similar to previously observed sequences found on

**Accession**

**Reads/Frequency**

**
D
_{
GB
}
**

**
D
_{
sample
}
**

**
D
_{
chim
}
**

**Source**

**Type**

**DADA**

**AN**

_{
GB
}, _{
sample
}and _{
chim
}are the Hamming distances to the given GenBank entry, the nearest sample genotype, and the optimal chimera for each putative contaminant denoised sequence. No entries are given for the

Divergent

FR697039

14/4×10^{−4}

0

11

10

Lake Water

Bacterium

Y

Y

EU633742

1/3×10^{−5}

1

9

9

Showerhead

Methylobacter

Y

Y

JF515955

1/3×10^{−5}

1

8

8

Soil

Nitrosomonadaceae

N

Y

Titanium

FJ004768

77/3×10^{−3}

2

39

27

Soil

Bacterium

Y

Y

JF190756

1/4×10^{−5}

1

40

27

Human Skin

Bacterium

Y

Y

JQ462329

2/8×10^{−5}

0

7

5

Human Mouth

Bacterium

Y

N

In classifying false negatives, we sought to evaluate the ability of the algorithms to detect the presence of genuine diversity in the pyrosequencing reads. However, not all clones used to construct the samples in

**Sample**

**Genotypes**

**Present and distinct**

The number of genotypes used to construct the sample and the number that were present and distinct and so could detected by a denoising algorithm.

Divergent

23

23

Artificial

90

50

Titanium

91

80

Several aspects of this

Appendix 2: _{
a
}robustness

To assess whether _{
a
}, we evaluated each data set under all three _{
a
}values. The results are given in Table _{
a
} = 10^{−100} for all three data sets.

**Divergent**

**Artificial**

**Titanium**

**
Ω
_{
a
}
**

**False Pos**

**False Neg**

**False Pos**

**False Neg**

**False Pos**

**False Neg**

10^{−15}

0

0

10

2

7

0

10^{−40}

0

0

1

2

6

0

10^{−100}

0

0

1

2

6

0

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

MJR designed the algorithm, wrote the software, performed the analyses, and wrote the paper. BJC, DSF, and SPH provided assistance and advice with algorithm design, comparative analysis, and the paper. All authors read and approved the final manuscript.

Acknowledgements

Thanks to Chris Quince and Sue Huse for providing and helping with clarifications about test data. Thanks also to Yijun Sun for help with the