Alternative to a very large dictionary (~40 million keys)

2k views Asked by At

I have a rather large dictionary with about 40 million keys which I naively stored just by writing {key: value, key: value, ...} into a text file. I didn't consider the fact that I could never realistically access this data because python has an aversion to loading and evaluating a 1.44GB text file as a dictionary.

I know I could use something like shelve to be able to access the data without reading all of it at once, but I'm not sure how I would even convert this text file to a shelve file without regenerating all the data (which I would prefer not to do). Are there any better alternatives for storing, accessing, and potentially later changing this much data? If not, how should I go about converting this monstrosity over to a format usable by shelve?

If it matters, the dictionary is of the form {(int, int, int int): [[int, int], Bool]}

2

There are 2 answers

2
dagnelies On

https://github.com/dagnelies/pysos

https://github.com/dagnelies/pysos

It works like a normal python dict, but has the advantage that it's much more efficient than shelve on windows and is also cross-platform, unlike shelve where the data storage differs based on the OS.

To install:

pip install pysos

Usage:

import pysos
db = pysos.Dict('somefile')
db['hello'] = 'persistence!'

Just to give a ballpark figure, here is a mini benchmark (on my windows laptop):

import pysos
t = time.time()
import time
N = 100 * 1000
db = pysos.Dict("test.db")
for i in range(N):
    db["key_" + str(i)] = {"some": "object_" + str(i)}
db.close()

print('PYSOS time:', time.time() - t)
# => PYSOS time: 3.424309253692627

The resulting file was about 3.5 Mb big.

So, in your case, if a million key/value pairs take roughly 1 minute to insert ...it would take you almost an hour to insert it all. Of course, the machine's specs can influence that a lot. It's just a very rough estimate.

1
Gonzalo Matheu On

Redis is a in-memory key-value store that can be used for this kind of problems.

There are several Python clients.

hmset operation allows you to insert multiple key-values.