I made my own implementation of HyperLogLog algorithm
. It works well, but sometimes I have to fetch a lot (around 10k-100k) of HLL structures and merge them.
I store each of them as a bit string so first I have to convert each bit string to buckets. Since there is a LOT of HLL's it takes more time than I would like it to.
Currently around 80% of runtime takes this line of code called once for each HLL:
my @buckets = map { oct '0b'.$_ } unpack('(a5)1024', $bitstring);
Is there any way to do it faster?
If we leave definition of HyperLogLog behind, task could be explained like this: given that $bitstring
consists from 1024 5-bit counters (so each counter could have value up to 32) we have to convert it to array of 1024 integers.
The
a
denotes arbitrary, zero-padded binary data. Here, you treat that data as ASCII text but it can only contain1
and0
! This is inefficient asa5
ends up using five bytes. The easiest and most efficient solution would be to store a 8-bit number for each counter, then:my @buckets = unpack 'C1024', $bitstring
.If you only want to store five bits per counter, you end up saving very little memory for very much hassle. You'd have to use something insane like this for a round-trip conversion: