Discrimination plots for a typical cluster in the Artificial data set with 4691 reads. (a) simulated errors drawn from the error model and (b) the real errors in the cluster. Sequences (diamonds) are characterized by abundance and the probability λper read of having been produced. On the x-axis, we plot logλscaled by the most common error probability, TA→G, so that values can be interpreted as an effective Hamming distance. The dashed lines delineate the region – the lower left quadrant – where, for significance thresholds Ωaand Ωr provided by the user, DADA accepts that a sequence could have arisen via the error model. The vertical dashed lines shows the λbelow which (or the effective distance above which) the read p-value rejects sequences as being errors, and the curved dashed line shows the abundances above which the abundance p-value rejects sequences as being errors for each value of λ. There are several sequences in the real data (red diamonds) that would be rejected by the abundance p-value at the Ωa = .01 significance level; we posit that early round PCR effects are a suitable candidate to explain these departures from the error model.
Rosen et al. BMC Bioinformatics 2012 13:283 doi:10.1186/1471-2105-13-283