Optimal way to compress 60 bit string

562 views Asked by At

Given 15 random hexadecimal numbers (60 bits) where there is always at least 1 duplicate in every 20 bit run (5 hexdecimals).

What is the optimal way to compress the bytes?

Here are some examples:

01230 45647 789AA
D8D9F 8AAAF 21052
20D22 8CC56 AA53A
AECAB 3BB95 E1E6D
9993F C9F29 B3130

Initially I've been trying to use Huffman encoding on just 20 bits because huffman coding can go from 20 bits down to ~10 bits but storing the table takes more than 9 bits.

Here is the breakdown showing 20 bits -> 10 bits for 01230

Character   Frequency   Assignment  Space Savings
0           2           0           2×4 - 2×1 = 6 bits
2           1           10          1×4 - 1×2 = 2 bits
1           1           110         1×4 - 1×3 = 1 bits
3           1           111         1×4 - 1×3 = 1 bits

I then tried to do huffman encoding on all 300 bits (five 60bit runs) and here is the mapping given the above example:

Character   Frequency   Assignment  Space Savings
---------------------------------------------------------
a           10          101         10×4 - 10×3 = 10 bits
9           8           000         8×4 - 8×3 = 8 bits
2           7           1111        7×4 - 7×4 = 0 bits
3           6           1101        6×4 - 6×4 = 0 bits
0           5           1100        5×4 - 5×4 = 0 bits
5           5           1001        5×4 - 5×4 = 0 bits
1           4           0010        4×4 - 4×4 = 0 bits
8           4           0111        4×4 - 4×4 = 0 bits
d           4           0101        4×4 - 4×4 = 0 bits
f           4           0110        4×4 - 4×4 = 0 bits
c           4           1000        4×4 - 4×4 = 0 bits
b           4           0011        4×4 - 4×4 = 0 bits
6           3           11100       3×4 - 3×5 = -3 bits
e           3           11101       3×4 - 3×5 = -3 bits
4           2           01000       2×4 - 2×5 = -2 bits
7           2           01001       2×4 - 2×5 = -2 bits

This yields a savings of 8 bits overall, but 8 bits isn't enough to store the huffman table. It seems because of the randomness of the data that the more bits you try to encode with huffman the less effective it works. Huffman encoding seemed to work best with 20 bits (50% reduction) but storing the table in 9 or less bits isnt possible AFAIK.


In the worst-case for a 60 bit string there are still at least 3 duplicates, the average case there are more than 3 duplicates (my assumption). As a result of at least 3 duplicates the most symbols you can have in a run of 60 bits is just 12.

Because of the duplicates plus the less than 16 symbols, I can't help but feel like there is some type of compression that can be used

3

There are 3 answers

1
Itchy On BEST ANSWER

If I split your question in two parts:

  1. How do I compress (perfect) random data: You can't. Every bit is some new entropy which can't be "guessed" by a compression algorithm.
  2. How to compress "one duplicate in five characters": There are exactly 10 options where the duplicate can be (see table below). This is basically the entropy. Just store which option it is (maybe grouped for the whole line).

These are the options:

AAbcd = 1    AbAcd = 2    AbcAd = 3    AbcdA = 4    (<-- cases where first character is duplicated somewhere)
             aBBcd = 5    aBcBd = 6    aBcdB = 7    (<-- cases where second character is duplicated somewhere)
                          abCCd = 8    abCdC = 9    (<-- cases where third character is duplicated somewhere)
                                       abcDD = 0    (<-- cases where last characters are duplicated)

So for your first example:

01230 45647 789AA

The first one (01230) is option 4, the second 3 and the third option 0.

You can compress this by multiplying each consecutive by 10: (4*10 + 3)*10 + 0 = 430 And uncompress it by using divide and modulo: 430%10=0, (430/10)%10=3, (430/10/10)%10=4. So you could store your number like that:

1AE 0123 4567 789A
^^^ this is 430 in hex and requires only 10 bit

The maximum number for the three options combined is 1000, so 10 bit are enough.

Compared to storing these 3 characters normally you save 2 bit. As someone else already commented - this is probably not worth it. For the whole line it's even less: 2 bit / 60 bit = 3.3% saved.

0
Mark Adler On

If I simply count the number of 20-bit values with at least two hexadecimal digits equal, there are 524,416 of them. A smidge more than 219. So the most you could possibly save is a little less than one bit out of the 20.

Hardly seems worth it.

0
Promaster On

If you want to get rid of the duplicates first, do this, then look at the links at the bottom of the page. If you don't want to get rid of the duplicates, then still look at the links at the bottom of the page:

Array.prototype.contains = function(v) {
  for (var i = 0; i < this.length; i++) {
    if (this[i] === v) return true;
  }
  return false;
};

Array.prototype.unique = function() {
  var arr = [];
  for (var i = 0; i < this.length; i++) {
    if (!arr.contains(this[i])) {
      arr.push(this[i]);
    }
  }
  return arr;
}

var duplicates = [1, 3, 4, 2, 1, 2, 3, 8];
var uniques = duplicates.unique(); // result = [1,3,4,2,8]

console.log(uniques);

Then you would have shortened your code that you have to deal with. Then you might want to check out Smaz

Smaz is a simple compression library suitable for compressing strings.

If that doesn't work, then you could take a look at this:

http://ed-von-schleck.github.io/shoco/

Shoco is a C library to compress and decompress short strings. It is very fast and easy to use. The default compression model is optimized for english words, but you can generate your own compression model based on your specific input data.

Let me know if it works!