Open Access Highly Accessed Research article

Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing

André Gilles1, Emese Meglécz1, Nicolas Pech1, Stéphanie Ferreira2, Thibaut Malausa3 and Jean-François Martin4*

Author Affiliations

1 Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France

2 Genoscreen, Genomic Platform and R&D, Campus de l'Institut Pasteur, 1 rue du Professeur Calmette, Bâtiment Guérin, 4ème étage, 59000 Lille, France

3 Institut National de la Recherche Agronomique, UMR 1301, Equipe BPI, 400 route des Chappes, BP 167, 06903 Sophia-Antipolis Cedex, France

4 UMR CBGP (INRA/IRD/Cirad/Montpellier SupAgro), Campus international de Baillarguet, CS 30016, F-34988 Montferrier-sur-Lez cedex, France

For all author emails, please log on.

BMC Genomics 2011, 12:245  doi:10.1186/1471-2164-12-245

Published: 19 May 2011

Additional files

Additional file 1:

Number of sequences to correct erroneous positions. 1a: this file illustrates the number of sequences necessary to obtain a majority of correct sequences. The x-axis shows the error rate and the y-axis shows the number of sequences needed, according to three possible probabilities: 0.001 0.01 and 0.05. 1b the x-axis shows the error rate for a given position (ranging from 0 to 0.5); the y-axis shows the cumulative proportion of erroneous sequences sampled (ranging from 0 to 0.5) in the total sample. Sample size varies from 10 to 100, 500 and 1,000 sequences. For a given error rate and a cumulative proportion of erroneous sequences in the sample of size N, the probability of observing this combination is indicated in color: green: 1 to 0.95, blue: 0.95 to 0.8, yellow: 0.8 to 0.6, orange: 0.6 to 0.5, red: 0.5 to 0.4, gray: 0.4 to 0.2 and white: below 0.2. For example, if the error rate is 0.2, the probability of observing a cumulative proportion of erroneous sequences in the sample of between 0 and 0.2 ranges between 0.4 and 0.5 (red envelope). In this case, the probability of there being 20% erroneous sequences in the sample is between 0.4 and 0.5. If we consider the same error rate (0.2) with 40% erroneous sequences, then the probability ranges from 0.8 to 0.95 (blue envelope). If N increases, the variance of the probability envelopes decreases.

Format: PDF Size: 5.3MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 2:

Distribution of errors along the reference sequences. The blue line represents the proportion of sequences generated (y-axis) according to the sequence position (x-axis), using data obtained from the analysis of reference 5 reference sequences (excluding reference #3, which is displayed in Figure 1). The error rate for each type of error (insertions, deletions, mismatches and ambiguous base calls) is presented as a function of the sequence position (x-axis) and specific position on the y-axis. The position and length of homopolymers for each base is given on the x-axis to facilitate interpretation (green: A, red: T, black: G, blue: C).

Format: PDF Size: 265KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3:

Breakdown of error rate variation using all available variables. For each plate, we used a logistic model to decipher the role of each selected variable in explaining the variation of error rate (see materials and methods). The figure is broken down by error type: a) insertions, b) deletions, c) mismatches and d) ambiguous base calls. We tested the deviance from the complete model by breaking down the model into the sum of three terms: the first exclusive to the single effect of the variable considered (in black), the second exclusive effect of the rest of the variables without the variable of interest (in gray) and the last expressing the sum of the effects of interactions between the variable considered and the other variables (in white). The contribution of each term (the proportion) for a considered variable can be viewed on the y-axis. Additional file 3 displays the results for plates #2 and #3 (results from the plate #1 are presented as Figure 2).

Format: PDF Size: 117KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 4:

Spatial localization of error rate variation. For each error type and the sequence length, the x-axis represents the spatial localization of 454 reads as x-coordinates and the y-axis represents the y-coordinates on the PT plate. The results presented in this additional data file 4 correspond to plates #2 and #3. The strips represent the regions. We display separately the four types of error (insertions, deletions, mismatches and ambiguous base calls) and the length of the generated sequences. Colors represent the ranges of error rates from 0 to 1 (or the length of the sequences from 0 to 500), using a sliding window (see materials and methods).

Format: PDF Size: 550KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

FASTA file of the 6 reference sequences. The six reference DNA sequences used in this analysis are found in the corresponding FASTA file. They correspond to the control DNA fragments of type I provided with 454 GS-FLX Titanium sequencing kits. As such, the polymorphism displayed by the sequences corresponds purely to sequencing errors.

Format: FAS Size: 3KB Download file

Open Data

Additional file 6:

Raw data sequences from 454 GS-FLX Titanium sequencing. This file contains three archives, including the raw FASTA files for each sequencing run.

Format: RAR Size: 2.3MB Download file

Open Data