Fast key-value disk storage for Python

1.3k views Asked by At

I'm wondering if there is a fast on-disk key-value storage with Python bindings which supports millions of read/write calls to separate keys. My problem involves counting word co-occurrences in a very large corpora (Wikipedia), and continually updating co-occurrence counts. This involves reading and writing ~300 million values 70 times with 64 bit keys, and 64 bit values.

I can also represent my data as an upper-triangular sparse matrix with dimensions ~ 2M x 2M.

So far I have tried:

  • Redis (64GB RAM is not large enough)
  • TileDB SparseArray (no way to add to values)
  • Sqlite (way too slow)
  • LMDB (batching the 300 million read/write in transactions takes multiple hours to execute)
  • Zarr (coordinate based updating is SUPER slow)
  • Scipy .npz (can't keep the matrices in memory for addition part)
  • sparse COO with memmapped coords and data (RAM usage is massive when adding matrices)

Right now the only solution which works well enough is LMDB, but the runtime is ~12 days which seems unreasonable since it does not feel like I'm processing that much data. Saving the sub-matrix (with ~300M values) to disk using .npz is almost instant.

Any ideas?

3

There are 3 answers

3
Congyu WANG On

You might want to check out my project.

pip install rocksdict

This is a fast on-disk key-value storage based on RockDB, it can take any python object as value. I consider it to be reliable and easy to use. It has a performance that's on par with GDBM, but it is cross-platform compared to GDBM which is only available for python on Linux.

https://github.com/Congyuwang/RocksDict

Below is a demo:

from rocksdict import Rdict, Options

path = str("./test_dict")

# create a Rdict with default options at `path`
db = Rdict(path)

db[1.0] = 1
db[1] = 1.0
db["huge integer"] = 2343546543243564534233536434567543
db["good"] = True
db["bad"] = False
db["bytes"] = b"bytes"
db["this is a list"] = [1, 2, 3]
db["store a dict"] = {0: 1}

import numpy as np
db[b"numpy"] = np.array([1, 2, 3])

import pandas as pd
db["a table"] = pd.DataFrame({"a": [1, 2], "b": [2, 1]})

# close Rdict
db.close()

# reopen Rdict from disk
db = Rdict(path)
assert db[1.0] == 1
assert db[1] == 1.0
assert db["huge integer"] == 2343546543243564534233536434567543
assert db["good"] == True
assert db["bad"] == False
assert db["bytes"] == b"bytes"
assert db["this is a list"] == [1, 2, 3]
assert db["store a dict"] == {0: 1}
assert np.all(db[b"numpy"] == np.array([1, 2, 3]))
assert np.all(db["a table"] == pd.DataFrame({"a": [1, 2], "b": [2, 1]}))

# iterate through all elements
for k, v in db.items():
    print(f"{k} -> {v}")

# batch get:
print(db[["good", "bad", 1.0]])
# [True, False, 1]
 
# delete Rdict from dict
del db
Rdict.destroy(path)
0
Majid soorani On

PySpark is more useful here .

PairFunction<String, String, String> keyData =
  new PairFunction<String, String, String>() {
  public Tuple2<String, String> call(String x) {
    return new Tuple2(x.split(" ")[0], x);
  }
};

JavaPairRDD<String, String> pairs = lines.mapToPair(keyData); https://www.oreilly.com/library/view/learning-spark/9781449359034/ch04.html

1
pufferfish On

Have a look at Plyvel, which is a python interface to LevelDB.

I used it successfully several years ago, and both projects appear to still be active. My own use-case was storing 100s of millions of key:value pairs, and I was more focussed on read performance, but it looks optimized for write also.