Given a bloom filter of size N-bits and K hash functions, of which M-bits (where M <= N) of the filter are set.
Is it possible to approximate the number of elements inserted into the bloom filter?
Simple Example
I've been mulling over the following example, assuming a BF of 100-bits and 5 hash functions where 10-bits are set...
Best case scenario: Assuming the hash functions are really perfect and uniquely map a bit for some X number of values, then given 10-bits have been set we can say that there has only been 2 elements inserted into the BF
Worst case scenario: Assuming the hash functions are bad and consistently map to the same bit (yet unique amongst each other), then we can say 10 elements have been inserted into the BF
The range seems to be [2,10] where abouts in this range is probably determined by the false-positive probability of filter - I'm stuck at this point.
This question worries me a bit because there are better algorithms for approximately counting the number of distinct elements with small amounts of storage.
Nevertheless, if we must use a Bloom filter, let's assume that the hash functions are random oracles (all values chosen independently, or "really perfect", not to be confused with perfect hashing). Now we have a balls and bins problem: given that
M
ofN
bins have balls in them, how many balls did we throw? LetB
be the number of balls thrown; the number of items isB/K
, since for each item we throwK
balls.The standard approximation for balls and bins processes is to model each bin as an independent Poisson process; the time before a bin becomes occupied is exponentially distributed. Letting
1
be the time it takes to throw all of the balls, the maximum likelihood estimateλ
of the rate of this exponential distribution satisfiesPr(Exponential[λ] < 1) = M/N
, so1 - exp(-λ) = M/N
andλ = -log(1 - M/N)
. The parameterλ
is akin to the number of balls, so the estimate for the number of items isB ≈ -N log(1 - M/N)/K
.EDIT: There are
N
bins, so we need to multiply byN
.