# Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing

- † Equal contributors

^{1} Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France

^{2} Genoscreen, Genomic Platform and R&D, Campus de l'Institut Pasteur, 1 rue du Professeur Calmette, Bâtiment Guérin, 4ème étage, 59000 Lille, France

^{3} Institut National de la Recherche Agronomique, UMR 1301, Equipe BPI, 400 route des Chappes, BP 167, 06903 Sophia-Antipolis Cedex, France

^{4} UMR CBGP (INRA/IRD/Cirad/Montpellier SupAgro), Campus international de Baillarguet, CS 30016, F-34988 Montferrier-sur-Lez cedex, France

*BMC Genomics* 2011, **12**:245
doi:10.1186/1471-2164-12-245

### Additional files

**Additional file 1:**

**Number of sequences to correct erroneous positions**. 1a: this file illustrates the number of sequences necessary to obtain a majority
of correct sequences. The x-axis shows the error rate and the y-axis shows the number
of sequences needed, according to three possible probabilities: 0.001 0.01 and 0.05.
1b the x-axis shows the error rate for a given position (ranging from 0 to 0.5); the
y-axis shows the cumulative proportion of erroneous sequences sampled (ranging from
0 to 0.5) in the total sample. Sample size varies from 10 to 100, 500 and 1,000 sequences.
For a given error rate and a cumulative proportion of erroneous sequences in the sample
of size N, the probability of observing this combination is indicated in color: green:
1 to 0.95, blue: 0.95 to 0.8, yellow: 0.8 to 0.6, orange: 0.6 to 0.5, red: 0.5 to
0.4, gray: 0.4 to 0.2 and white: below 0.2. For example, if the error rate is 0.2,
the probability of observing a cumulative proportion of erroneous sequences in the
sample of between 0 and 0.2 ranges between 0.4 and 0.5 (red envelope). In this case,
the probability of there being 20% erroneous sequences in the sample is between 0.4
and 0.5. If we consider the same error rate (0.2) with 40% erroneous sequences, then
the probability ranges from 0.8 to 0.95 (blue envelope). If N increases, the variance
of the probability envelopes decreases.

Format: PDF Size: 5.3MB Download file

This file can be viewed with: Adobe Acrobat Reader

**Additional file 2:**

**Distribution of errors along the reference sequences**. The blue line represents the proportion of sequences generated (y-axis) according
to the sequence position (x-axis), using data obtained from the analysis of reference
5 reference sequences (excluding reference #3, which is displayed in Figure 1). The
error rate for each type of error (insertions, deletions, mismatches and ambiguous
base calls) is presented as a function of the sequence position (x-axis) and specific
position on the y-axis. The position and length of homopolymers for each base is given
on the x-axis to facilitate interpretation (green: A, red: T, black: G, blue: C).

Format: PDF Size: 265KB Download file

This file can be viewed with: Adobe Acrobat Reader

**Additional file 3:**

**Breakdown of error rate variation using all available variables**. For each plate, we used a logistic model to decipher the role of each selected variable
in explaining the variation of error rate (see materials and methods). The figure
is broken down by error type: a) insertions, b) deletions, c) mismatches and d) ambiguous
base calls. We tested the deviance from the complete model by breaking down the model
into the sum of three terms: the first exclusive to the single effect of the variable
considered (in black), the second exclusive effect of the rest of the variables without
the variable of interest (in gray) and the last expressing the sum of the effects
of interactions between the variable considered and the other variables (in white).
The contribution of each term (the proportion) for a considered variable can be viewed
on the y-axis. Additional file 3 displays the results for plates #2 and #3 (results from the plate #1 are presented
as Figure 2).

Format: PDF Size: 117KB Download file

This file can be viewed with: Adobe Acrobat Reader

**Additional file 4:**

**Spatial localization of error rate variation**. For each error type and the sequence length, the x-axis represents the spatial localization
of 454 reads as x-coordinates and the y-axis represents the y-coordinates on the PT
plate. The results presented in this additional data file 4 correspond to plates #2
and #3. The strips represent the regions. We display separately the four types of
error (insertions, deletions, mismatches and ambiguous base calls) and the length
of the generated sequences. Colors represent the ranges of error rates from 0 to 1
(or the length of the sequences from 0 to 500), using a sliding window (see materials
and methods).

Format: PDF Size: 550KB Download file

This file can be viewed with: Adobe Acrobat Reader

**Additional file 5:**

**FASTA file of the 6 reference sequences**. The six reference DNA sequences used in this analysis are found in the corresponding
FASTA file. They correspond to the control DNA fragments of type I provided with 454
GS-FLX Titanium sequencing kits. As such, the polymorphism displayed by the sequences
corresponds purely to sequencing errors.

Format: FAS Size: 3KB Download file

**Additional file 6:**

**Raw data sequences from 454 GS-FLX Titanium sequencing**. This file contains three archives, including the raw FASTA files for each sequencing
run.

Format: RAR Size: 2.3MB Download file