What is the best way to compress a (sorted) monotically increasing sequence of longs?
Characteristics about the sequence:
- all the values are unique (no repeated)
- length of sequence can range from very small (3 numbers) to extremely large (100k numbers)
- the numbers are often close in proximity (almost consecutive), but then suddenly make a big jump e.g. 1, 2, 3, 1000, 1001, 1002, 15001, 15002
- the sequence is not finite, there should be an insertion operation for adding new numbers to the sequence which should also get compressed
Goal:
- Compress the sequence as small as possible
What have I tried:
- Delta Encoding seemed like a good idea for the numbers close in proximity. However, I couldn't figure out how to pack the bits for these delta values.
- Calculating the deltas and then Huffman Encoding the delta array. I didn't think this would provide the optimal compression, chars in Java take 2 bytes and I would also have to store the huffman tree
- SIMD operations: tried reading lemire's article about SIMD compression, but found it difficult to understand and implement in Java
- Run length encoding wouldn't work because all numbers are unique
Delta coding is a good start. Then use variable-length integers to code the deltas. Since the deltas are all positive, subtract one. Code each integer as a sequence of bytes with the high bit set on all but the last bytes. You get seven bits of the integer per byte. So 0..127 (or a delta of 1..128) is coded in one byte. 128..16383 is coded in two bytes. And so on. Then use a standard lossless compressor on the output of that.