How exactly does file compression work at a low level using Huffman coding? (in C)

237 views Asked by At

TL;DR: How does the compression of plaintext using a Huffman code actually work?

I'm currently learning the Huffman coding algorithm and its application to text file compression. I understand that we could store the same data with less size by using an encoding technique (e.g. Huffman coding) which is determined by the frequency distribution of each character in the text file.

In Huffman coding we want the most frequent character in a text file to get the shortest binary representation (variable-length encoding), hence in total the amount of storage needed for the file is fewer than those of fixed-length encoding such as ASCII.

However I still have no idea on how to actually implement the compression. What kind of file should i use to store the Huffman-encoded binary representation of the text file?. How does the process of compressing the plaintext (probably in .txt format) into a compressed file actually work? Does decompression also work the same way as compression, just in the the reverse direction?

I've tried using binary file in C to store the binary representation of a .txt file. As you can expect the binary file actually became bigger than the original file.

I've read that converting a plain-text file into a compressed file is just a matter of replacing each letter with an appropriate bit string and then handling the possibility of having some extra bits that need to be written. However I still haven't found any good reference on what is a bit string and how to work with it.

Any reference would be helpful, and any answer with C implementation would be perfect. Thank you.

1

There are 1 answers

0
Mark Adler On

There is only one kind of file. A sequence of bytes. Each byte has eight bits. For Huffman coding you consider the file to be a sequence of bits as opposed to bytes. You accumulate the bits in a buffer, and when you have bytes you write them out to the file. Something like:

// Write the low bits of code to stdout. The remaining bits of code must be zero.
void put_bits(int bits, unsigned code) {
    static int have = 0;
    static unsigned buf = 0;
    if (bits == -1) {
        // flush remaining bits
        if (have) {
            putchar(buf);
            have = 0;
            buf = 0;
        }
        return;
    }
    buf |= code << have;
    have += bits;
    while (have >= 8) {
        putchar(buf);
        buf >>= 8;
        have -= 8;
    }
}