Which hash function does HyperLogLog use?

242 views Asked by At

I have read in a few articles that HyperLogLog and LogLog use a hash function and that it is solely responsible for the prediction value. If we assign a value to a certain username to predict the number of times the individual has visited a page, and that value is constant for the name, the algorithm will contain bias.

For example, if Bob is hashed as 1000101 and its value is always going to be this, how will the algorithm predict the number of times a unique user has visited a page? All Bobs are always going to have a constant value 1 even if a new Bob visits the page. Could anyone explain how LogLog is able to predict the value of the unique visitors.

I know it uses a 64 bit hash and the above example is just for understanding.

1

There are 1 answers

2
templatetypedef On BEST ANSWER

There are two separate processes involved here, and I think your confusion is coming from mixing the two of them up.

You can think of a HyperLogLog estimator as a black box that has two operations:

  • see(x), which records that x has been seen, and
  • estimate(), which returns an estimate of how many distinct values have been seen.

In your example, you're discussing the possibility that multiple distinct people named Bob visit a website, and website then calls see(Bob) multiple times. In that case, you're right - HyperLogLog has no idea that these are different people named Bob. But that has nothing to do with the internal hash functions being used - that's purely a consequence of the fact that HyperLogLog only promises to estimate the number of distinct things it's seen, and if you treat each person named Bob as just "Bob," HyperLogLog has no way of knowing they aren't different people.

(As an analogy, imagine you asked me to count how many different people were in the list [Bob, Bob, Bob, Bob, Bob]. I'd assume the answer is "one," but if you then told me "actually, each Bob is a different person," my response would be "ah, I see, but the way you asked me to solve the problem wasn't fair because you didn't give me any information to differentiate the Bobs from one another.")

In practice, though, if you wanted to count distinct users to a website, you wouldn't just use a first name as a way of identifying a person. You might, for example, use a person's IP address, treating different IP addresses as different visitors. Or you might require a login of some sort, then use a person's internal ID number.

As for which hash function HyperLogLog uses, that's separate from the above discussion. The original paper on HyperLogLog doesn't say which hash function to use and instead just assumes it's given one that looks "random enough." In practice, you'd go with whichever hash function you thought looked "random enough," which could be something like the Jenkins hash, or SHA-256, etc. depending on context.