Speeding up my implementation of HyperLogLog algorithm

345 views Asked by At

I made my own implementation of HyperLogLog algorithm. It works well, but sometimes I have to fetch a lot (around 10k-100k) of HLL structures and merge them.

I store each of them as a bit string so first I have to convert each bit string to buckets. Since there is a LOT of HLL's it takes more time than I would like it to.

Currently around 80% of runtime takes this line of code called once for each HLL:

my @buckets = map { oct '0b'.$_ } unpack('(a5)1024', $bitstring);

Is there any way to do it faster?

If we leave definition of HyperLogLog behind, task could be explained like this: given that $bitstring consists from 1024 5-bit counters (so each counter could have value up to 32) we have to convert it to array of 1024 integers.

1

There are 1 answers

2
amon On BEST ANSWER

The a denotes arbitrary, zero-padded binary data. Here, you treat that data as ASCII text but it can only contain 1 and 0! This is inefficient as a5 ends up using five bytes. The easiest and most efficient solution would be to store a 8-bit number for each counter, then: my @buckets = unpack 'C1024', $bitstring.

If you only want to store five bits per counter, you end up saving very little memory for very much hassle. You'd have to use something insane like this for a round-trip conversion:

my $bitstring = pack "(b5)1024", map { sprintf "%b", $_ } @buckets;
@buckets = map { oct "0b$_" } unpack "(b5)1024", $bitstring;