Data structure for storing classification of billions of 64-bit integers

47 views Asked by At

I have a set of ~10.2 billion 64-bit elements that are:

  1. Constant and known before compile time (so no insertion, deletion, or modification).
  2. Classified into 3 categories, also constant and known before compile time for all elements.
  3. They store permutations of [0...11] and an 11-bit tag that could be any number between [0, 2047]. Most permutations will have an element for each of the 2048 possible tags. Specifically, (element >> (4 * pos)) & 0b1111 would denote where the element at pos is permuted to, and element >> 48 would give the 11-bit tag.

Constraints that I want to satisfy:

  1. Memory consumption: within 4GB to fit in RAM, so at most ~3.1 bits per element on average.
  2. No false positives/negatives: Bloom filters are fine as a potential component, but some fallback must be in place to guarantee correctness.
  3. "Fast" look up of an element's category: Since everything here is constant, my understanding is that O(log n) would effectively be a small constant. The operation itself happens extremely often, so I think estimations of clock cycles would be a better measure than asymptotic runtime.
  4. (Not as important) Runs well on a GPU: from my understanding, this means arithmetic operations are much prefered to branching and global memory accesses.

Taking into account the memory constraints and the partially structured & constant nature of the data, similar posts could not provide me a complete solution.

Most conventional data structures would fail the memory requirements, since they usually store the elements themselves (and 64 is evidently not within ~3.1 bits). Bloom filters, on their own, are not ideal since one miscategorization could break the rest of the algorithm.

Minimal perfect hashing functions could be a viable solution. Since most permutation has an element corresponding to a tag, a potential approach is separating the 11-bit tag from the 48-bit permutation, use a minimal perfect hash on the permutation bits, then shift it and add the 11-bit tag back. We then use that to index into a bitset where 2 bits is used to denote the category of the element. We would have ~1.1 bits left per element, but the minimal perfect hash is only on the permutation bits, so we effectively have 2252.8 bits per element for each permutation. However, I'm not sure which minimal perfect hash scheme would fit my use case best (obvious choice if all 12! permutations are in play would be the factorial number system, but only ~1% of them is actually in the data), and I feel like there might be a more direct way of figuring this out.

Since there are 3 categories, I was thinking of making some sort of set for the first 2 categories and query for membership (checking the 3rd category isn't neccessary since if it's not in the first 2 then it's in the 3rd). I'm completely unsure what would be suitable for this, since a perfect hash function seems to contain too much information if just membership testing is neccessary. I've also looked into trie and related strcutures, but I'm not sure if they would apply to the problem; from what I've read they would consume too much memory. Any guidance on this topic is greatly appreciated!

P.S. Everything is going to be implemented in C/C++ so resources in those languages is strongly prefered but not neccessary.

0

There are 0 answers