Motivation
Imagine a huffman compressed file that gets downloaded partially, like in P2P software, so we allocate disk space for the whole file first and then start downloading random file chunks. One of the huffman codes (but we don't know which) is an end code, so if this code is decoded, we stop. Assuming the file consists of several huffman compressed streams, we can try to decompress some of them before the download finished.
The way the disk space is preallocated is important now: Let's assume we have the start of a huffman stream, but it isn't complete, so we run into the preallocated disk space. Usually, this space is all 0, so we'd keep decoding symbols with huffman code 00..
. If this isn't our end code, we don't notice the "invalid" data and f.e. if there's 2 GB of preallocated space, we're doing much useless decoding.
So we'd like to preallocate the space in a way to make decoding stop as soon as possible.
The question
I'm looking for the shortest bitstring that acts as a "Huffman terminator", meaning if we decode this string, we'll decode every Huffman code at least once, so we'll definitely receive an end code. This should work for every combination of Huffman codes of length 1..n
bits.
Note: I know there are simple solutions to the hypothetical scenario above (using 00..
as an end code, using P2P segment data to detect chunks not downloaded yet), but this is just an example scenario to show a theoretical use of a "Huffman terminator" bitstring, I'm not interested in solving this scenario, but looking for algorithms/ways/ideas to generate/find bitstrings that act as "Huffman terminator" .
Example
Let's look at the possible Huffman code combinations for n = 2
: [0, 1]
, [00, 01, 1]
, [0, 10, 11]
, [00, 01, 10, 11]
. Now let's start with a bitstring that contains every possible bit sequence of length 1..n
(0
, 1
, 00
, 01
, 10
, 11
):
001011
Decoding with the different Huffman code combinations gives (huffman codes are assigned to symbols A..D
):
Combination Decoded symbols
[0, 1] AABABB
[00,01,1] ACBC
[0,10,11] AABC
[00,01,10,11] ACD
This is a good start and it already decodes every Huffman code for the first three ones, but if we decode it with [00, 01, 10, 11]
, we're missing symbol B
(Huffman code 01
). So let's just append this to our bitstring:
00101101
This is a valid "Huffman terminator" for n=2
, 8 bits in length. If we'd preallocate our disk space with this byte, we'd be sure to terminate all huffman codes that won't exceed 2 bits. We even know there won't be shorter terminator strings for n=2
because it's the minimal length for the combination [00, 01, 10, 11]
to decode each symbol once.
I also found a "Huffman terminator" for n=3
, 0001011001110100111010011100010101111101110
(43 bits), but I'm not 100% sure if it's correct and I don't know if it's the shortest one.
What I'm looking for
Algorithms/ideas to find or generate Huffman terminators for a given
n
. My attempt would be similar to the example: generating a start string and appending bits as needed to satisfy all different Huffman code combinations. But I'm sure there are better ways.Specific Huffman terminators,
n=8
andn=16
, although it might be computational expensive to generate them and they will perhaps be very long.Papers/links about this problem (or similar ones) if there are any.
Bonus
Bonus points for finding "Huffman terminators" that also work if we start at bit position 1..n
, so it even terminates if data was decoded before and we won't arrive and start a new huffman code at the first bit.
If I understand correctly, then any universal terminator for an up-to-n-bit Huffman code requires at least n * 2^n bits, since it could be the case that there are 2^n source symbols (including the unknown termination symbol) that each occur with equal probability, thus necessitating an n-bit code for each symbol. This also tells you that any such minimal-length universal terminator will be a permutation of these 2^n blocks of n bits.
So for example for n=16, no universal terminator can be shorter than 1048576 bits, or 128Kb. (And of course it may need to be much longer.)