Figure 5.

Image artifacts can generate false sequences. Cluster identification can identify crystals, dust and lint particles as well as other flow cell features as sequence clusters (A). Indicated are 103 non-library sequences originating from a lint particle that has been observed in a library that was sequenced with a three base pair tag ('GAC') in the beginning of each read. In this case, non-library sequences could therefore be distinguished based on these first three bases. The fraction of such artifact clusters is increased for low loading density and low intensity runs. A sequence entropy filter is efficient for removing the majority of these sequences (82.52% for a cutoff of 0.85), but also removes non-artifact sequences (B) - as indicated in the figure, 0.01% of the human reference genome (GRCh37/hg19). For 3'/5' tagged libraries or indexed sequencing libraries, filtering for the index/tag is therefore superior to base composition/sequence entropy filters for removing such sequencing artifacts.

Kircher et al. BMC Genomics 2011 12:382   doi:10.1186/1471-2164-12-382
Download authors' original image