Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Highly Accessed Correspondence

Addressing challenges in the production and analysis of illumina sequencing data

Martin Kircher1, Patricia Heyn2 and Janet Kelso1*

Author Affiliations

1 Max Planck Institute for Evolutionary Anthropology, Department of Evolutionary Genetics Deutscher Platz 6 04103 Leipzig, Germany

2 Max Planck Institute of Molecular Cell Biology and Genetics Pfotenhauerstrasse, 108 01307 Dresden, Germany

For all author emails, please log on.

BMC Genomics 2011, 12:382  doi:10.1186/1471-2164-12-382

Published: 29 July 2011

Additional files

Additional File 1:

Merging of paired end reads efficiently removes adapter sequence for short insert libraries and increases read accuracy. Shown is the average sequencing error of the two simulated raw reads (black) in comparison to the sequencing error remaining after read merging for different adapter start points. The development is shown for two different types of simulated quality scores (red and green). In red, the quality score is the average error observed for the specific base-type in this cycle (i.e. all Adenines at this position in the read have the same quality score), while in green an error-informative quality score was simulated. For this type of quality score a random number between 0 and 10 (uniform sampling) was added to the average quality score of this base when the correct base was simulated and a random number between 0 and 10 (uniform sampling) was subtracted if a wrong base was simulated. The average reduction of error (starting from 0.244%) is 1.93 × (0.126%) for the position-dependent quality scores and 4.98 × (0.049%) for the error-informative quality scores. For sequences shorter or equal to read length (5-101nt) a reduction of error (0.146%) by a factor of 1.62 × (0.090%) and 20.88 × (0.007%) is observed, respectively. Sequences are required to have more than 10nt overlap for merging and merged sequences below 5nt are discarded as adapter dimers by the program.

Format: PDF Size: 33KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional File 2:

Reduction of the number of clusters identified in tile images due to identical tag sequences. If all or the vast majority sequences start identical in the first read of a sequencing run, image analysis will consider a higher fraction of the clusters as being grown into each other and remove them. This effect is for example observed if libraries are made from restriction digested molecules or if tag/barcode sequences are added on the outer molecule edges and read in the first read. Changing parameters for an image offline analysis (Firecrest module) can be used as a work-a-round. The figure table shows cluster counts as well as a section of the image of the same tile in cycle 1 and 4 for a run from the Neandertal Genome project (Green et al: Science 2010) 080902_BIOLAB29_Run_PE51_1 in which the tag 'GAC' was read in the beginning of the first read. Cluster counts were obtained from IPAR v1.01 image analysis (cluster identification based only on the first cycle of the run) and the results for a version of the Firecrest v1.9.5 algorithm, in which cluster identification was done in cycle 4.

Format: PDF Size: 67KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional File 3:

Quality score distribution of artifact reads largely overlaps with the quality score distribution of regular reads. Sequences resulting from crystals, dust and lint particles as well as other flow cell features are typically of low complexity (Additional File 2) but only partially of low quality. Plotted is the quality score frequency distribution (PHRED-scale, Ibis base caller) for all reads matching the 'GAC' library tag in the beginning of the read (black, n = 557,466,159 bases from 10,930,709 reads) as well as all sequences not matching the tag sequence and its one base pair substitutions (red, n = 3,481,668 bases from 68,268 reads). The data was obtained from lane 5 of the 080902_BIOLAB29_Run PE51_1 run from the Neandertal Genome project (Green et al: Science 2010).

Format: PDF Size: 40KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional File 4:

Non-random distribution of sequencing error across sequencing clusters. Random cluster generation results in a wide range of inter-cluster distances, causing sequencing error to be non-randomly distributed across clusters. The fraction of reads with two errors is not equal to the squared fraction of reads with one error. Shown are the observed rates for reads with 1 to 5 errors for different Illumina Genome Analyzer data sets (solid lines) presented as test data sets for the Ibis base caller (Kircher et al: Genome Biology 2009) and the expected rates when extrapolating from the fraction of molecules with one error (dashed line).

Format: PDF Size: 30KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional File 5:

Reduction in sequencing error when using the Ibis base caller for different instrument chemistries. Alternative base callers significantly reduce error rate and thereby increase the output of usable reads. The Ibis base caller (Kircher et al. Genome Biology 2009) has a wide support for different instrument and software versions, as well as for single read, paired-end read and multiplex sequencing runs. It is based on training sequencing cycle-specific machine learning models from a training data set, like for example a φ × 174 spike-in control. Based on this data, also quality scores are adjusted for each run and are therefore comparable between sequencing runs and libraries without further normalization.

Format: PDF Size: 53KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data