Shelve dictionary size is >100Gb for a 2Gb text file

1.1k views Asked by At

I am creating a shelve file of sequences from a genomic FASTA file:

# Import necessary libraries
import shelve
from Bio import SeqIO

# Create dictionary of genomic sequences
genome = {}
with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        genome[str(record.id)] = str(record.seq)

# Shelve genome sequences
myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db")
myShelve.update(genome)
myShelve.close()

The file itself is 2.6Gb, however when I try and shelve it, a file of >100Gb is being produced, plus my computer will throw out a number of complaints about being out of memory and the start up disk being full. This only seems to happen when I try to run this under OSX Yosemite, on Ubuntu it works as expected. Any suggestions why this is not working? I'm using Python 3.4.2

2

There are 2 answers

0
hynekcer On

Verify what interface is used for dbm by import dbm; print(dbm.whichdb('your_file.db') The file format used by shelve depends on the best installed binary package available on your system and its interfaces. The newest is gdbm, while dumb is a fallback solution if no binary is found, ndbm is something between.
https://docs.python.org/3/library/shelve.html
https://docs.python.org/3/library/dbm.html

It is not favourable to have all data in the memory if you lose all memory for filesystem cache. Updating by smaller blocks is better. I even don't see a slowdown if items are updated one by one.

myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db")
with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle:
    for i, record in enumerate(SeqIO.parse(handle, "fasta")):
        myShelve.update([(str(record.id), str(record.seq))])
myShelve.close()

It is known that dbm databases became fragmented if the app fell down after updates without calling database close. I think that this was your case. Now you probably have no important data yet in the big file, but in the future you can defragment a database by gdbm.reorganize().

0
Jpsy On

I had the very same problem: On a macOS system with a shelve with about 4 Megabytes of data grew to the enormous size of 29 Gigabytes on disk! This obviously happened because I updated the same key value pairs in the shelve over and over again.

As my shelve was based on GNU dbm I was able to use his hint about reorganizing. Here is the code that brought my shelve file back to normal size within seconds:

import dbm
db = dbm.open(shelfFileName, 'w')
db.reorganize()
db.close()

I am not sure whether this technique will work for other (non GNU) dbms as well. To test your dbm system, remember the code shown by @hynekcer:

import dbm
print( dbm.whichdb(shelfFileName) )

If GNU dbm is used by your system this should output 'dbm.gnu' (which is the new name for the older gdbm).