Computing the approximate population of a bloom filter

1.3k views Asked by At

Given a bloom filter of size N-bits and K hash functions, of which M-bits (where M <= N) of the filter are set.

Is it possible to approximate the number of elements inserted into the bloom filter?

Simple Example

I've been mulling over the following example, assuming a BF of 100-bits and 5 hash functions where 10-bits are set...

Best case scenario: Assuming the hash functions are really perfect and uniquely map a bit for some X number of values, then given 10-bits have been set we can say that there has only been 2 elements inserted into the BF

Worst case scenario: Assuming the hash functions are bad and consistently map to the same bit (yet unique amongst each other), then we can say 10 elements have been inserted into the BF

The range seems to be [2,10] where abouts in this range is probably determined by the false-positive probability of filter - I'm stuck at this point.

2

There are 2 answers

0
swen On

This question worries me a bit because there are better algorithms for approximately counting the number of distinct elements with small amounts of storage.

Nevertheless, if we must use a Bloom filter, let's assume that the hash functions are random oracles (all values chosen independently, or "really perfect", not to be confused with perfect hashing). Now we have a balls and bins problem: given that M of N bins have balls in them, how many balls did we throw? Let B be the number of balls thrown; the number of items is B/K, since for each item we throw K balls.

The standard approximation for balls and bins processes is to model each bin as an independent Poisson process; the time before a bin becomes occupied is exponentially distributed. Letting 1 be the time it takes to throw all of the balls, the maximum likelihood estimate λ of the rate of this exponential distribution satisfies Pr(Exponential[λ] < 1) = M/N, so 1 - exp(-λ) = M/N and λ = -log(1 - M/N). The parameter λ is akin to the number of balls, so the estimate for the number of items is B ≈ -N log(1 - M/N)/K.

EDIT: There are N bins, so we need to multiply by N.

0
mcdowella On

The entry at Wikipedia gives you a formula for the probability of any particular bit being set, assuming that the hash functions make everything random. This is 1 - (1-1/m)^kn. Since there are m bits in the filter, this means that the expected/mean number of bits set will be m(1-(1-1/m)^kn). So you could come up with a reasonably plausible guess for n by choosing the n that makes this equal to the number of bits actually set.

To get and idea of how accurate such a guess is likely to be, it would be nice to have an idea of the variance of the number of bits set. You can work this out exactly, but it is something of a pain in the neck. You can use the fact that Var(X) = E(X^2) - E(X)^2. In this case, E(X^2) depends mostly on the probability that pairs of bits will both be set, and you can work this out by considering the probability of statements like "bit 0 is set and bit 1 is clear and all the other bits are clear" and "bit 0 is clear" and "all bits except for 0 and 1 are clear".