On-disk structure for storing a large set of 128-bit integers?

508 views Asked by At

I have about 500 million 128-bit integers, adding about 100M per year. Nothing is ever deleted. The numbers come at a uniform distribution, scale-wise and time-wise.

Basically, all I need is an add operation that also returns whether the number already exists in the DB. Also, I do not want to use too much RAM for this system, so just storing everything in memory is not what I'm looking for.

So far we've been using several MyISAM tables on MySQL, using two bigints as a primary key. This give us OK performance, but I suspect it is not the right tool for this job. We've had some performance problems before splitting tables, and we've had corruptions on power outages. Also, a DB gives us many more feature that we do not need.

I'm using Python on Linux, but I'm open to suggestions.

Similar question in C++.

UPDATE: Marcelo's comment mentioned Bloom Filter, which seems really promising to me. Since I'm working with hashes, I've already given up on complete accuracy, so this might be a great precision/performance trade off.

2

There are 2 answers

4
Marcelo Cantos On BEST ANSWER

Insert each integer into one of a pool of 2n SQLite databases (28 is probably a good number) chosen by computing an n-bit hash of the integer. Make the one column of the one table a primary key, so that attempting to insert an existing number fails.

Assuming the integers are already random enough, you can probably just pick, say, the least significant byte as the "hash".

EDIT: I've done some testing. I've inserted 25 million entries in about two hours, but it has gobbled up over 1 GB in the process. This is done by generating random numbers and distributing them to 32 subprocesses, each with one SQLite database under its control and commits once every 100,000 inserts. Insertions are currently chugging along at about 1350 Hz, far beyond the 3 Hz your problem demands, but the entire database still fits in cache (I have 8 GB RAM). I won't know the steady-state performance unless I get close to your current database size. At that point, every insertion will induce at least four disk-head movements (read and write the index and table, probably more to drill down into the B+ tree), and then you'll know how much pain you're really in for.

I'm starting to think that this is a genuinely interesting problem that could be solved much more efficiently with a custom-built solution. However, I suspect it will take a reasonable amount of effort to significantly outperform a database engine.

1
High Performance Mark On

Hash your hashes ?