How to compress a strictly increasing sequence of Longs

53 views Asked by At

What is the best way to compress a (sorted) monotically increasing sequence of longs?

Characteristics about the sequence:

  • all the values are unique (no repeated)
  • length of sequence can range from very small (3 numbers) to extremely large (100k numbers)
  • the numbers are often close in proximity (almost consecutive), but then suddenly make a big jump e.g. 1, 2, 3, 1000, 1001, 1002, 15001, 15002
  • the sequence is not finite, there should be an insertion operation for adding new numbers to the sequence which should also get compressed

Goal:

  • Compress the sequence as small as possible

What have I tried:

  • Delta Encoding seemed like a good idea for the numbers close in proximity. However, I couldn't figure out how to pack the bits for these delta values.
  • Calculating the deltas and then Huffman Encoding the delta array. I didn't think this would provide the optimal compression, chars in Java take 2 bytes and I would also have to store the huffman tree
  • SIMD operations: tried reading lemire's article about SIMD compression, but found it difficult to understand and implement in Java
  • Run length encoding wouldn't work because all numbers are unique
1

There are 1 answers

0
Mark Adler On

Delta coding is a good start. Then use variable-length integers to code the deltas. Since the deltas are all positive, subtract one. Code each integer as a sequence of bytes with the high bit set on all but the last bytes. You get seven bits of the integer per byte. So 0..127 (or a delta of 1..128) is coded in one byte. 128..16383 is coded in two bytes. And so on. Then use a standard lossless compressor on the output of that.