The setup
Say I've got:
A series of numbers resulting from LZW compression of a bitmap:
256 1 258 258 0 261 261 259 260 262 0 264 1 266 267 258 2 273 2 262 259 274 275 270 278 259 262 281 265 276 264 270 268 288 264 257
An LZW-compressed, variable-length-encoded bytestream (including the LZW code size header and sub-block markers) which represents this same series of numbers:
00001000 00101001 00000000 00000011 00001000 00010100 00001000 10100000 01100000 11000001 10000001 00000100 00001101 00000010 01000000 00011000 01000000 11100001 01000010 10000001 00000010 00100010 00001010 00110000 00111000 01010000 11100010 01000100 10000111 00010110 00000111 00011010 11001100 10011000 10010000 00100010 01000010 10000111 00001100 01000001 00100010 00001100 00001000 00000000
And an
initial code width
of8
.
The problem
I'm trying to derive the initial series of numbers (the integer array) from the bytestream.
From what I've read, the procedure here is to take the initial code width
, scan right-to-left, reading initial code width + 1
bits at a time, to extract the integers from the bytestream. For example:
iteration #1: 1001011011100/001/ yield return 4
iteration #2: 1001011011/100/001 yield return 1
iteration #3: 1001011/011/100001 yield return 6
iteration #4: 1001/011/011100001 yield return 6
This procedure will not work for iteration #5, which will yield 1:
iteration #5: 1/001/011011100001 yield return 1 (expected 9)
The code width should have been increased by one.
The question
How am I supposed to know when to increase the code width when reading the variable-length-encoded bytestream? Do I have all of the required information necessary to decompress this bytestream? Am I conceptually missing something?
UPDATE:
After a long discussion with greybeard - I found out that I was reading the binary string incorrectly: 00000000 00000011 00
is to be interpreted as 256, 1
. The bytestream is not read as big-endian.
And very roughly speaking, if you are decoding a bytestream, you increase the number of bits read every time you read 2^N-1 codes, where N is the current code width.
Decompressing, you are supposed to build a dictionary in much the same way as the compressor. You know you need to increase the code width as soon as the compressor might use a code too wide for the current width.
As long as the dictionary is not full (the maximum code is not assigned), a new code is assigned for every (regular) code put out (not the Clear Code or End Of Information codes).
With the example in the presentation you linked,
8
is assigned when the second6
is "transmitted" - you need to switch to four bits before reading the next code.(This is where the example and your
series of numbers
differ - the link presents4, 1, 6, 6, 2, 9
.)