On-disk structure for storing a large set of 128-bit integers?

Question

On-disk structure for storing a large set of 128-bit integers?

509 views Asked by itsadok At 18 November 2010 at 09:08

I have about 500 million 128-bit integers, adding about 100M per year. Nothing is ever deleted. The numbers come at a uniform distribution, scale-wise and time-wise.

Basically, all I need is an add operation that also returns whether the number already exists in the DB. Also, I do not want to use too much RAM for this system, so just storing everything in memory is not what I'm looking for.

So far we've been using several MyISAM tables on MySQL, using two bigints as a primary key. This give us OK performance, but I suspect it is not the right tool for this job. We've had some performance problems before splitting tables, and we've had corruptions on power outages. Also, a DB gives us many more feature that we do not need.

I'm using Python on Linux, but I'm open to suggestions.

There are 2 answers

High Performance Mark On 18 November 2010 at 13:20

Hash your hashes ?

**Marcelo Cantos** · Accepted Answer · 2010-11-18T09:54:41+00:00

Insert each integer into one of a pool of 2ⁿ SQLite databases (2⁸ is probably a good number) chosen by computing an n-bit hash of the integer. Make the one column of the one table a primary key, so that attempting to insert an existing number fails.

Assuming the integers are already random enough, you can probably just pick, say, the least significant byte as the "hash".

EDIT: I've done some testing. I've inserted 25 million entries in about two hours, but it has gobbled up over 1 GB in the process. This is done by generating random numbers and distributing them to 32 subprocesses, each with one SQLite database under its control and commits once every 100,000 inserts. Insertions are currently chugging along at about 1350 Hz, far beyond the 3 Hz your problem demands, but the entire database still fits in cache (I have 8 GB RAM). I won't know the steady-state performance unless I get close to your current database size. At that point, every insertion will induce at least four disk-head movements (read and write the index and table, probably more to drill down into the B+ tree), and then you'll know how much pain you're really in for.

I'm starting to think that this is a genuinely interesting problem that could be solved much more efficiently with a custom-built solution. However, I suspect it will take a reasonable amount of effort to significantly outperform a database engine.

TechQA.

On-disk structure for storing a large set of 128-bit integers?

There are 2 answers

Related Questions in DATA-STRUCTURES

Related Questions in ON-DISK

Popular Questions

Popular Tags

Trending Questions