Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France

Genoscreen, Genomic Platform and R&D, Campus de l'Institut Pasteur, 1 rue du Professeur Calmette, Bâtiment Guérin, 4ème étage, 59000 Lille, France

Institut National de la Recherche Agronomique, UMR 1301, Equipe BPI, 400 route des Chappes, BP 167, 06903 Sophia-Antipolis Cedex, France

UMR CBGP (INRA/IRD/Cirad/Montpellier SupAgro), Campus international de Baillarguet, CS 30016, F-34988 Montferrier-sur-Lez cedex, France

Abstract

Background

The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse

Results

We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.

Conclusions

The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

Background

Scientific strategies and approaches based on next-generation sequencing (NGS) have been revolutionizing genetics over the last few years. Many aspects of basic, applied and clinical research now rely on the generation of enormous amounts of sequence data from various sample sources, to assess polymorphism (mostly SNPs), or expression data (RNA-Seq) at the genome level ^{© }technology (Illumina, Inc.) and SOLiDTM systems (Applied BiosystemsTM) offer a number of complementary solutions for specific requirements (see Metzker

One of the basic questions arising from this spectacular increase in sequence volume concerns the possible detrimental effects of this shift in quantity on the quality of the obtained data. In other words, is there a tradeoff between the quantity and quality of information? It is widely accepted that next-generation sequencing approaches generate such large amounts of sequence data that even if overall accuracy (derived from error rate) or quality (percentage of error-free sequences) is suboptimal it is still possible to reconstruct polymorphism rigorously by comparing redundant sequences that cover the same genomic region multiple times (i.e. depth of coverage provides accuracy, not the individual read)

In 2007, S. Huse and collaborators raised the question of the accuracy and quality of massively parallel pyrosequencing GS20 systems, performing an empirical analysis of the per-base error rate

Furthermore, in addition to estimating the per-base error rate, we aimed to identify the potential causes of sequencing errors and possible solutions for improving both the accuracy and quality of pyrosequences. We selected several variables likely to affect sequencing errors directly or indirectly: (i) the position of the nucleotide base within the sequence (the beginning of the sequence may be more accurate than the end), (ii) the primary structure of the sequence, including, in particular, the presence of homopolymers, (iii) the length of the sequence generated (a sequence may be short due to quality filtering, resulting from an accumulation of errors or the stochastic ending of polymerization), and (iv) the position of the bead carrying the sequence both within and between the regions on a PT plate (PicoTiterPlate) (edge effect), and between multiple PT plates. Our analyses are based on Roche test fragments. These are sequences used for GS-FLX Titanium diagnostics that are included in all runs, but not subjected to PCR amplification before sequencing. Thus with these fragments we estimate the sequencing error due to pyrosequencing. Huse et al.

Results and Discussion

Accuracy and quality of sequences

We assessed the quality of the sequences obtained by 454 GS-FLX Titanium sequencing, using the control DNA fragment Type I sequences (provided with 454 sequencing kits) as reference templates (see Materials and Methods for details). As these internal controls are added to the pyrosequencing process during the sequencing step, they are modified only by sequencing errors and are not related to any previous step. The quality of these control sequences is not influenced by the samples themselves, particularly with Titanium technology, in which loading beads are isolated from each other and there should therefore be no interference from adjacent beads. We analyze here the 86,237 sequences that passed the quality filters, representing the six control DNA fragments from three 454 GS-FLX runs. These results revealed several general trends in the sequencing error generated by 454 GS-FLX Titanium technology (Table

Comparative analysis of the accuracy and quality of sequences

**# of sequences**

**% of error-free sequences**

**# of positions**

**Insertions**

**Deletions**

**Mismatch**

**Ambiguous**

**Total % of error**

GS20 (101)

34015

82.00%

32801429

0.18%

0.13%

0.08%

0.10%

0.49%

Ref 1 (101)

16052

87.12%

1605640

0.15%

0.05%

0.01%

0.01%

0.22%

Ref 2 (101)

16466

60.01%

1600327

0.42%

0.23%

0.04%

0.01%

0.70%

Ref 3 (101)

12215

72.96%

1228804

0.17%

0.19%

0.01%

0.01%

0.38%

Ref 4 (101)

9908

56.43%

984452

0.30%

0.37%

0.03%

0.00%

0.70%

Ref 5 (101)

15880

50.93%

1595718

0.34%

0.48%

0.05%

0.01%

0.88%

Ref 6 (101)

15716

75.17%

1581075

0.25%

0.10%

0.00%

0.01%

0.36%

Total

86237

67.57%

8596016

0.27%

0.23%

0.02%

0.01%

0.53%

Ref 1 (572)

16052

6.75%

5359696

0.52%

0.46%

0.10%

0.12%

1.20%

Ref 2 (552)

16466

9.75%

4789285

0.89%

0.28%

0.10%

0.08%

1.35%

Ref 3 (500)

12215

18.75%

4180478

0.30%

0.35%

0.07%

0.12%

0.84%

Ref 4 (532)

9908

6.88%

2572843

0.56%

0.71%

0.19%

0.11%

1.57%

Ref 5 (592)

15880

7.46%

6171098

0.38%

0.38%

0.06%

0.07%

0.89%

Ref 6 (516)

15716

11.81%

6027338

0.60%

0.17%

0.07%

0.04%

0.88%

Total

86237

10.09%

29100738

0.54%

0.36%

0.09%

0.09%

1.07%

The different types of error are detailed for each reference sequence for 454 sequencing. Errors are classified according to the nomenclature used by Huse et al. (2007): insertions, deletions, mismatches and ambiguous base calls (see materials and methods). Error rates are given for two length categories (first 101 bases vs. full length).

The error rate for the first 101 sequenced positions (corresponding to 8,596,016 examined bases) displayed a mean = 0.534% (95% CI: [0.529, 0.539]) (45,895 erroneous bases) for 454 GS-FLX Titanium data. This global error rate is five times higher than the error rate obtained by the analyses of GS20 test fragments and is similar to that obtained from for GS20 experimental sequences. Indeed, 0.49% of the positions were erroneous for a comparable dataset relating to 101 positions (Table _{1/2 }= 0.215), followed by deletions (0.232% [0.229, 0.235]; q_{1/2 }= 0.170), mismatches (0.022% [0.021, 0.023]; q_{1/2 }= 0.010), and ambiguous base calls (0.007% [0.006, 0.007]; q_{1/2 }= 0.010). This pattern is entirely consistent with that described by Huse

If we restricted the analysis to full-length sequences (500 to 592 positions), we found for the 86,237 sequences that passed the 454 quality filters (29,100,738 bases) that 312,351 bases were erroneous (1.073% [1.069, 1.077]). The pattern observed for the first 101 positions was confirmed for the full-length sequence data, with insertions (0.541% [0.538, 0.543]; q_{1/2 }= 0.465) and deletions (0.359% [0.357, 0.362]; q_{1/2 }= 0.350) being the most common types of error and mismatches (0.088% [0.087, 0.089]; q_{1/2 }= 0.085) and ambiguous base calls (0.085% [0.084, 0.086]; q_{1/2 }= 0.090) making a smaller contribution to global error rate. Only 8,702 of the 86,237 full-length sequences (10.09% [9.89, 10.29]) had no error with respect to the corresponding reference sequence. This result strongly contrasts with the higher proportion of error-free sequences for the first 101 bases.

The comparison of error rates between sequences of different lengths (first 101 positions

However, the consequences of this may be relatively minor even if most sequences display errors (89.91 [89.71, 90.11]), as the overall error rate is low, with only 1.07% of bases being problematic. It is widely believed that deep sequencing coverage (multiple independent sequences for the same locus) should make it possible to correct for errors in this context

**Number of sequences to correct erroneous positions**. 1a: this file illustrates the number of sequences necessary to obtain a majority of correct sequences. The x-axis shows the error rate and the y-axis shows the number of sequences needed, according to three possible probabilities: 0.001 0.01 and 0.05. 1b the x-axis shows the error rate for a given position (ranging from 0 to 0.5); the y-axis shows the cumulative proportion of erroneous sequences sampled (ranging from 0 to 0.5) in the total sample. Sample size varies from 10 to 100, 500 and 1,000 sequences. For a given error rate and a cumulative proportion of erroneous sequences in the sample of size N, the probability of observing this combination is indicated in color: green: 1 to 0.95, blue: 0.95 to 0.8, yellow: 0.8 to 0.6, orange: 0.6 to 0.5, red: 0.5 to 0.4, gray: 0.4 to 0.2 and white: below 0.2. For example, if the error rate is 0.2, the probability of observing a cumulative proportion of erroneous sequences in the sample of between 0 and 0.2 ranges between 0.4 and 0.5 (red envelope). In this case, the probability of there being 20% erroneous sequences in the sample is between 0.4 and 0.5. If we consider the same error rate (0.2) with 40% erroneous sequences, then the probability ranges from 0.8 to 0.95 (blue envelope). If N increases, the variance of the probability envelopes decreases.

Click here for file

Distribution of errors along sequences

**Distribution of errors along sequences**. The blue line indicates the proportion of generated sequences (y-axis) as a function of sequence position (x-axis), based on data obtained from the analysis of reference sequence #3. The error rate for each type of error (insertions, deletions, mismatches and ambiguous base calls) is presented as a function of sequence position (X-axis) and specific position on the y axis. The position and length of homopolymers for each base are given on the x-axis to facilitate interpretation (green: A, red: T, black: G, blue: C). See additional file

This pattern is particularly problematic for 454 data, as the number of sequences significantly decreases after 300 bases (see Figure

**Distribution of errors along the reference sequences**. The blue line represents the proportion of sequences generated (y-axis) according to the sequence position (x-axis), using data obtained from the analysis of reference 5 reference sequences (excluding reference #3, which is displayed in Figure 1). The error rate for each type of error (insertions, deletions, mismatches and ambiguous base calls) is presented as a function of the sequence position (x-axis) and specific position on the y-axis. The position and length of homopolymers for each base is given on the x-axis to facilitate interpretation (green: A, red: T, black: G, blue: C).

Click here for file

This issue is further complicated by the heterogeneous distribution of the error types among the six different control DNA reference sequences, within and between gasket regions for a PT GS-FLX Titanium plate and also between PT plates, as initially estimated from the large standard errors (derived from table

Interactions between variables and error characterization

The evolution of 454 technology combines progress in chemistry, acquisition devices, such as CCD cameras and PT plates handling equipment, and improvements in quality filters and base-calling algorithms. All these modifications are potential sources of variation in the amount, length and quality of sequences. In this work, we analyzed the interaction of seven variables identified as potential sources of sequencing error. We characterized sequencing error as a function of information about position in the sequence (^{2 }= 2613.3, df = 2, P < 2 × 10^{-16}). The significant result obtained in this test is mostly due to the high power of detection associated with the large number of samples available, but this heterogeneity requires the specification of individual parameter values for the logistic model describing each PT plate. The three runs were therefore analyzed separately. This approach did not prevent us from extracting the common trends influencing error rate and distribution. The models (for each plate and for each type of error) explained between 14.32% and 37.38% of the error distribution and were highly significant (P < 2 × 10^{-16}).

The nullity of r (Bravais-Pearson correlation coefficient) between pairs of the seven variables was tested independently for each run. As the usual assumptions required to infer the distribution of the test statistics were not met, we used permutations to approximate the distribution of the test statistic under H_{0}. We used a type I error rate of 0.05 and Benjamini-Hochberg correction

The nature and significance of a correlation between two variables does not provide any information about the ability of this combination of variables to explain a third variable

Decomposition of error rate variation, using all available variables

**Decomposition of error rate variation, using all available variables**. For each plate, we used a logistic model to decipher the role of each selected variable and its contribution to error rate (see materials and methods). The error rate has been broken down as a function of error type: a) insertions, b) deletions, c) mismatches and d) ambiguous base calls. We tested the deviance from the complete model by breaking down the complete model into the sum of three terms: the first exclusive to the single effect of the variable considered (in black), the second exclusive effect of the rest of the variables without the variable of interest (in gray) and the last expressing the sum of the effects of interactions between the variable considered and the other variables (in white). The contribution of each term (the proportion) for a considered variable can be viewed on the y-axis. We display only the results for plate #1 (the results for the other plates are presented in additional file

**Breakdown of error rate variation using all available variables**. For each plate, we used a logistic model to decipher the role of each selected variable in explaining the variation of error rate (see materials and methods). The figure is broken down by error type: a) insertions, b) deletions, c) mismatches and d) ambiguous base calls. We tested the deviance from the complete model by breaking down the model into the sum of three terms: the first exclusive to the single effect of the variable considered (in black), the second exclusive effect of the rest of the variables without the variable of interest (in gray) and the last expressing the sum of the effects of interactions between the variable considered and the other variables (in white). The contribution of each term (the proportion) for a considered variable can be viewed on the y-axis. Additional file

Click here for file

At DNA sequence level, we detailed the variables individually accounting for the highest proportion of the error rate for each error type. It was essential to bear in mind, during this analysis, the fact that most of the explanatory power of these variables was obtained with combinations of variables. We analyzed each type of error independently.

For insertion errors (Figure

For deletion errors,

Finally, mismatch and ambiguous base call error rates were both found to be linked to

Given this pattern, the next step in the integration of information is characterizing the effect of bead localization on error rate. In particular, it is useful to consider whether position in a particular region or on the PT plate is linked to error rate. Heterogeneity in error rate as a function of bead location was found for insertions and deletions, whatever the PT plate analyzed. Heterogeneity was observed at both the region and plate scales. More precisely, error rate variation was mostly accounted for by the combination of several variables but, when the distribution of insertion errors fitted a gradient following the Y-axis in each region (Figure

Spatial distribution of error rate variation

**Spatial distribution of error rate variation**. For each error type and sequence length, the x-axis represents the spatial location of 454 reads and the y-axis represents the y-coordinates on the PT plate. The results presented in this figure correspond to plate #1. Data for the other two runs is presented in additional file

**Spatial localization of error rate variation**. For each error type and the sequence length, the x-axis represents the spatial localization of 454 reads as x-coordinates and the y-axis represents the y-coordinates on the PT plate. The results presented in this additional data file 4 correspond to plates #2 and #3. The strips represent the regions. We display separately the four types of error (insertions, deletions, mismatches and ambiguous base calls) and the length of the generated sequences. Colors represent the ranges of error rates from 0 to 1 (or the length of the sequences from 0 to 500), using a sliding window (see materials and methods).

Click here for file

Conclusions

From statistical inference to technical causes and perspectives

As detailed in the results and discussion section, error rate variability is mostly accounted for by the combination of the seven variables analyzed. However, the heterogeneous physical pattern may be partially driven by the combined influence of the central CCD camera (edge effect) with chemical flow direction (Y-axis). This explanation is, however, insufficient in itself to account for the observed pattern, and other variables clearly influence error rate. The negative relationship between insertion and deletion errors is probably related to physical acquisition issues, but chemistry-related artifacts probably also have an effect (through the related statistical variables analyzed), including the CAFIE effect (carry forward and incomplete extension) in particular. Carry forward occurs when a trace amount of nucleotide remains in a well after the apyrase wash, perpetuating premature nucleotide incorporations for specific sequence combinations during the next base flow and contributing to signal 'noise'. Incomplete extension occurs when some DNA strands on a bead fail to incorporate during the appropriate base flow. The strands that fail to incorporate must await another flow cycle for sequencing to continue and are thus incorporated out-of-phase with the rest of the strands

This study clearly demonstrates that sequencing error rate, as deciphered here, is a heterogeneous feature in 454 GS-FLX Titanium pyrosequencing. We cannot extrapolate the results obtained for other technologies, such as the GS20 system, to this system, nor is the use of a single global error rate inappropriate. Our results provide information about the number of sequences required to correct for a specific erroneous position, when detected, but this procedure requires the error rate to be computed from within the 454 PT plate regions in which the physical distribution of error rate is heterogeneous. Internal DNA controls should therefore be used when appropriate

Methods

Experimental design and reference sequences

We used the six control DNA fragment Type I sequences (as provided in Roche 454 protocols) as reference sequences. This made it possible to use a large number of strictly identical templates to characterize the sequencing error rate of this technology. The sequences generated constituted a set of three replicates from three different runs, making it possible to assess the quality and accuracy of the 454 GS-FLX Titanium method. Six references were used, with lengths ranging from 500 to 592 bp and GC contents from 52.75% to 65.85%; each of these reference sequences contained a large number of homopolymers (20 to 34), defined as a succession of three or more identical bases. Homopolymer positions are shown on Figure

**FASTA file of the 6 reference sequences**. The six reference DNA sequences used in this analysis are found in the corresponding FASTA file. They correspond to the control DNA fragments of type I provided with 454 GS-FLX Titanium sequencing kits. As such, the polymorphism displayed by the sequences corresponds purely to sequencing errors.

Click here for file

All reference sequence positions were classified according to the presence and length of a homopolymer: (i) the first and last bases of a homopolymer and those within two bases on either side of a homopolymer were coded "1". All the other positions within the homopolymer were coded "3" to "6" (the length of the homopolymer). All positions outside these zones (not influenced by the homopolymer) were coded "0".

The dataset consisted of 86,237 sequences, corresponding to 29,100,738 positions. Sequencing was carried out at Genoscreen, France. We aimed to identify factors linked to error rate. For a tractable analysis, we analyzed a dataset corresponding to all the positions at which an error was detected, plus a similar number of error-free positions randomly selected from the whole original dataset.

Sequencing error analysis

Reads (see additional file

**Raw data sequences from 454 GS-FLX Titanium sequencing**. This file contains three archives, including the raw FASTA files for each sequencing run.

Click here for file

In the analyses, the observation unit was the position on the 454-generated sequences. These positions were transformed into the position on the reference sequence. Insertions are reported with respect to the position of the base preceding the gaps. For each position, a binary variable was defined indicating the presence or absence of a sequencing error. An error is defined here as discordance between two homologous positions: the first in the reference sequence and the second in the generated sequence. Discordance may refer to an insertion, a deletion, a nucleotide mismatch or an ambiguous base call (N) with respect to a non-available nucleotide determination on the replicate sequence (according to Huse ** Position**, position in the sequence expressed as a proportion of the total length of the reference sequence (treated as a quantitative variable); (ii)

The R package was used for all statistical tests

Let us define as π_{i }^{*}_{i}_{i}^{*}_{i}^{*}_{i}). _{i }is the binary variable equal to 1 if an error is present and 0 otherwise. ^{*}_{i }is the vector (_{1i}, _{2i}, ..., _{7i}) of the explanatory variables. We chose to model the error rate π(^{*}_{i}) with a logistic model

Maximum likelihood estimators were considered to estimate the parameters of the model. Tests of significance of the parameters were then carried out with Student's t test. A model was generated for each of the three plates and for each of the error types (insertion, deletion, mismatch and N). All the analyses were performed with R (version 2.6.0).

The contribution of a given explanatory variable

Authors' contributions

AG conceived the study and wrote the manuscript. EM participated in the design of the study, performed the bioinformatics analysis and helped to write the manuscript. NP participated in the design of the study, performed the statistical analysis and helped to write the manuscript. SF participated in the design and performed the molecular biology. TM helped to write the manuscript. JFM conceived the study and wrote the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

We thank G. Nève for assistance with figure design. We thank M. Galan for useful comments on previous versions of the manuscript and S. Nielsen and J. Sappa (Alex Edelman) for major improvements of English grammar throughout the text. This work was supported by the AIP BioRessources "EcoMicro" grant from the French