I have tried using bloom filter for performing membership tests. I wish to perform membership tests on 80 billion entries with only allowing around 100 collisions to happen i.e., only 100 entries can be given a false positive result.
I know this can be achieved by bloom filters but using the formulas of determining the number of bits required per entry and the number of hash functions given the allowed false positive rate. I figured I would end up using 270 GB of memory and 19 hash functions.
I also had a look at Cuckoo filter but its memory requirements don't match my requirements. My Requirements are as follows :
- Use at max 6 bits per element
- Use no more than 7-8 Hash Functions.
Could someone suggest me a probabilistic data structure other than the one's mentioned above that can help in achieving my requirements?
The issue with the number of hash functions is not really a problem - just pick a single hash function with many bits of output and divide up the bits as if they were from separate hash functions. Your real problem here is the false positive rate tradeoff with storage space.
You have said
Entries that are in the map can, by definition, not be false positives. They are true positives.
The question then is "100 false positives taken over how many entries that you intend to test?" If the answer is also, oddly enough, 80 billion, then you are asking for a false positive rate of around 100/80,000,000,000 = 1/800,000,000, which is less than 2^-29.
The minimum space for any approximate membership data structure like Bloom filters or cuckoo filters is n lg 1/ε bits, where n is the number of elements in the structure, lg is the logarithm base 2, and ε is the false positive rate. In other words, you need more than 29 bits per element to achieve a false positive rate like 100 per 80 billion. 6 bits per element will get you 1.56% false positive rate at best. That's 1.25 billion per 80 billion, or 100 per 6400.
As far as I know there are no known practical data structures that come close to achieving this. Bloom filters don't, for instance, because they use more than lg 1/ε bits per item. Cuckoo filters don't because they use at least two additional metadata bits per item and have a bits-per-item rate that scales with lg n.