I am creating a shelve file of sequences from a genomic FASTA file:
# Import necessary libraries
import shelve
from Bio import SeqIO
# Create dictionary of genomic sequences
genome = {}
with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle:
for record in SeqIO.parse(handle, "fasta"):
genome[str(record.id)] = str(record.seq)
# Shelve genome sequences
myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db")
myShelve.update(genome)
myShelve.close()
The file itself is 2.6Gb, however when I try and shelve it, a file of >100Gb is being produced, plus my computer will throw out a number of complaints about being out of memory and the start up disk being full. This only seems to happen when I try to run this under OSX Yosemite, on Ubuntu it works as expected. Any suggestions why this is not working? I'm using Python 3.4.2
Verify what interface is used for dbm by
import dbm; print(dbm.whichdb('your_file.db')
The file format used by shelve depends on the best installed binary package available on your system and its interfaces. The newest isgdbm
, whiledumb
is a fallback solution if no binary is found,ndbm
is something between.https://docs.python.org/3/library/shelve.html
https://docs.python.org/3/library/dbm.html
It is not favourable to have all data in the memory if you lose all memory for filesystem cache. Updating by smaller blocks is better. I even don't see a slowdown if items are updated one by one.
It is known that dbm databases became fragmented if the app fell down after updates without calling database
close
. I think that this was your case. Now you probably have no important data yet in the big file, but in the future you can defragment a database bygdbm.reorganize()
.