Compression improves for larger sequencing runs and shorter read lengths.A) Each RNA-Seq dataset was trimmed to 50-base reads, and %unique reads was computed for a series of simulated sequencing run sizes (between 10 million single-end or paired-end reads and their original size). B) Each RNA-Seq dataset was randomly subset to 79 million single-end or paired-end reads, and %unique reads was computed for a series of simulated read lengths by trimming from the end (between 20 bases and their original read size). 25 bases is a typical sequence length that advanced RNA-Seq pipelines such as TopHat may use for segmented alignment.
Veeneman et al. BMC Bioinformatics 2012 13:297 doi:10.1186/1471-2105-13-297