Huffman algorithm inverse matching

638 views Asked by At

I was wondering if given a binary sequence we can check if it matches a string using the Huffman algorithm.

for example, if we a string "abdcc" and several binary sequences we can calculate which one is a possible representation of "abdcc" that used Huffman's algorithm

1

There are 1 answers

1
samgak On BEST ANSWER

Interesting puzzle. As mentioned by j_random_hacker in a comment, it's possible to do this using a backtracking search. There are a few constraints to valid Huffman encodings of the string that we can use to narrow the search down:

  • No two Huffman codes of length n and m can be identical in the first n or m bits (whichever is shorter). This is because otherwise a Huffman decoder wouldn't be able to tell if it had encountered the longer or the shorter code when decoding. And obviously two codes of the same length cannot be identical. (1)
  • If at any time there are less bits remaining in the bitstream than characters remaining in the string we are matching then the string cannot match. (2)
  • If we reach the end of the string and there are still bits remaining in the bitstream then the string does not match (3)
  • If we encounter a character in the string for the second time, and we have already assumed a Huffman code for that same character earlier in the string, then an identical code must be present in the bit stream or the string cannot match. (4)

We can define a function matchHuffmanString that matches a string with Huffman encoded bitstream, with a Huffman code table as part of the global state. To begin with the code table is empty and we call matchHuffmanString, passing the start of the string and the start of the bitstream.

When the function is called, it checks if there are enough bits in the stream to match the string and returns if not. (2)

If the string is empty, then if the bitstream is also empty then there is a match and the code table is output. If the stream is empty but the bitstream is not then there is no match so the function returns. (3)

If characters remain in the string, then the first character is read. The function checks if there is already an entry in the code table for that character, and if so then the same code must be present in the bitstream. If not then there is no match so the function returns (4). If there is then the function calls itself, moving on to the next character and past the matching code in the bitstream.

If there is no matching code for the character, then the possibility that it is represented by a code of every possible length n from 1 bit to 32 bits (an arbitrary limit) is considered. n bits are read from the bitstream and checked to see if such a code would conflict with any existing codes according to rule (1). If no conflict exists then the code is added to the code table, then the function recurses, moving onto the next character and past the assumed code of length n bits. After returning then it backtracks by removing the code from the table.

Simple implementation in C:

#include <stdio.h>

// Huffman table:

// a 01
// b 0001
// c 1
// d 0010

char* string = "abdcc";

// 01 0001 0010 1 1

// reverse bit order (MSB first) an add extra 0 for padding to stop getBits reading past the end of the array:
#define MESSAGE_LENGTH  (12)
unsigned int message[] = {0b110100100010, 0};

// can handle messages of >32 bits, even though the above message is only 12 bits long
unsigned int getBits(int start, int n)
{
    return ((message[start>>5] >> (start&31)) | (message[(start>>5)+1] << (32-(start&31)))) & ((1<<n)-1);
}

unsigned int codes[26];
int code_lengths[26];
int callCount = 0;

void outputCodes()
{
    // output the codes:
    int i, j;
    for(i = 0; i < 26; i++)
    {
        if(code_lengths[i] != 0)
        {
            printf("%c ", i + 'a');
            for(j = 0; j < code_lengths[i]; j++)
                printf("%s", codes[i] & (1 << j) ? "1" : "0");
            printf("\n");
        }
    }
}

void matchHuffmanString(char* s, int len, int startbit)
{
    callCount++;

    if(len > MESSAGE_LENGTH - startbit)
        return;  // not enough bits left to encode the rest of the message even at 1 bit per char (2)

    if(len == 0) // no more characters to match
    {
        if(startbit == MESSAGE_LENGTH)
        {
            // (3) we exactly used up all the bits, this stream matches.
            printf("match!\n\n");
            outputCodes();
            printf("\nCall count: %d\n", callCount);
        }
        return;
    }

    // read a character from the string (assume 'a' to 'z'):
    int c = s[0] - 'a';

    // is there already a code for this character?
    if(code_lengths[c] != 0)
    {
        // check if the code in the bit stream matches:
        int length = code_lengths[c];
        if(startbit + length > MESSAGE_LENGTH)
            return; // ran out of bits in stream, no match
        unsigned int bits = getBits(startbit, length);
        if(bits != codes[c])
            return; // bits don't match (4)

        matchHuffmanString(s + 1, len - 1, startbit + length);
    }
    else
    {
        // this character doesn't have a code yet, consider every possible length
        int i, j;
        for(i = 1; i < 32; i++)
        {
            // are there enough bits left for a code this long?
            if(startbit + i > MESSAGE_LENGTH)
                continue;

            unsigned int bits = getBits(startbit, i);
            // does this code conflict with an existing code?
            for(j = 0; j < 26; j++)
            {
                if(code_lengths[j] != 0) // check existing codes only
                {
                    // do the two codes match in the first i or code_lengths[j] bits, whichever is shorter?
                    int length = code_lengths[j] < i ? code_lengths[j] : i;
                    if((bits & ((1 << length)-1)) == (codes[j] & ((1 << length)-1)))
                        break; // there's a conflict (1)
                }
            }
            if(j != 26)
                continue; // there was a conflict

            // add the new code to the codes array and recurse:
            codes[c] = bits; code_lengths[c] = i;
            matchHuffmanString(s + 1, len - 1, startbit + i);
            code_lengths[c] = 0; // clear the code (backtracking)
        }
    }
}

int main(void) {
    int i;
    for(i = 0; i < 26; i++)
        code_lengths[i] = 0;

    matchHuffmanString(string, 5, 0);

    return 0;
}

output:

match!

a 01
b 0001
c 1
d 0010

Call count: 42

Ideone.com Demo

The above code could be improved by iterating over the string as long as it is encountering characters that it already has a code for, and only recursing when it finds one it doesn't. Also it only works for lowercase letters a-z with no spaces and doesn't do any validation. I'd have to test it to be sure, but I think it's a tractable problem even for long strings, because any possible combinatorial explosion only happens when encountering new characters that don't already have codes in the table, and even then it's subject to contraints.