Image artifacts can generate false sequences. Cluster identification can identify crystals, dust and lint particles as well as other flow cell features as sequence clusters (A). Indicated are 103 non-library sequences originating from a lint particle that has been observed in a library that was sequenced with a three base pair tag ('GAC') in the beginning of each read. In this case, non-library sequences could therefore be distinguished based on these first three bases. The fraction of such artifact clusters is increased for low loading density and low intensity runs. A sequence entropy filter is efficient for removing the majority of these sequences (82.52% for a cutoff of 0.85), but also removes non-artifact sequences (B) - as indicated in the figure, 0.01% of the human reference genome (GRCh37/hg19). For 3'/5' tagged libraries or indexed sequencing libraries, filtering for the index/tag is therefore superior to base composition/sequence entropy filters for removing such sequencing artifacts.
Kircher et al. BMC Genomics 2011 12:382 doi:10.1186/1471-2164-12-382