Alternative to a very large dictionary (~40 million keys)

Question

Alternative to a very large dictionary (~40 million keys)

2k views Asked by sToxic5 At 20 September 2017 at 01:56

I have a rather large dictionary with about 40 million keys which I naively stored just by writing {key: value, key: value, ...} into a text file. I didn't consider the fact that I could never realistically access this data because python has an aversion to loading and evaluating a 1.44GB text file as a dictionary.

I know I could use something like shelve to be able to access the data without reading all of it at once, but I'm not sure how I would even convert this text file to a shelve file without regenerating all the data (which I would prefer not to do). Are there any better alternatives for storing, accessing, and potentially later changing this much data? If not, how should I go about converting this monstrosity over to a format usable by shelve?

If it matters, the dictionary is of the form {(int, int, int int): [[int, int], Bool]}

Original Q&A

There are 2 answers

**dagnelies** · Answer 1 · 2020-11-25T09:25:18+00:00

https://github.com/dagnelies/pysos

It works like a normal python dict, but has the advantage that it's much more efficient than shelve on windows and is also cross-platform, unlike shelve where the data storage differs based on the OS.

To install:

pip install pysos

Usage:

import pysos
db = pysos.Dict('somefile')
db['hello'] = 'persistence!'

Just to give a ballpark figure, here is a mini benchmark (on my windows laptop):

import pysos
t = time.time()
import time
N = 100 * 1000
db = pysos.Dict("test.db")
for i in range(N):
    db["key_" + str(i)] = {"some": "object_" + str(i)}
db.close()

print('PYSOS time:', time.time() - t)
# => PYSOS time: 3.424309253692627

The resulting file was about 3.5 Mb big.

So, in your case, if a million key/value pairs take roughly 1 minute to insert ...it would take you almost an hour to insert it all. Of course, the machine's specs can influence that a lot. It's just a very rough estimate.

**Gonzalo Matheu** · Answer 2 · 2017-09-20T02:04:49+00:00

Gonzalo Matheu On 20 September 2017 at 02:04

Redis is a in-memory key-value store that can be used for this kind of problems.

There are several Python clients.

hmset operation allows you to insert multiple key-values.

TechQA.

Alternative to a very large dictionary (~40 million keys)

There are 2 answers

Related Questions in PYTHON

Related Questions in LARGE-FILES

Related Questions in SHELVE

Related Questions in LARGE-DATA

Popular Questions

Popular Tags

Trending Questions