Compressed sequence object (cseq) diagrams. Numbers below the data fields indicate the 0-based index in bits from the left end. (A) The sequence ACGTAA contains no N’s, so its encoding bit is 0, indicating 2 bits per base. By that encoding, two bytes are required to store 6 nucleotides, so the size field is 2. The sequence field is populated by A = 00, C = 01, G = 10, T = 11, etc., with the right-most byte padded on the right by zeros. (B) Compression proceeds as before, until the N nucleotide is encountered, at which point the compression starts over and sets the encoding to 1, indicating 3 bits per base. At that compression, now 3 bytes are required to store 6 nucleotides, and the size field is updated accordingly. The sequence field is populated by A = 000, C = 001, G = 010, T = 011, N = 100, etc., and again the right-end is padded with 0 s.
Veeneman et al. BMC Bioinformatics 2012 13:297 doi:10.1186/1471-2105-13-297