I am using pythons bz2 module to generate (and compress) a large jsonl file (bzip2 compressed 17GB).
However, when I later try to decompress it using pbzip2 it only seems to use one CPU-core for decompression, which is quite slow.
When i compress it with pbzip2 it can leverage multiple cores on decompression. Is there a way to compress within python in the pbzip2-compatible format?
import bz2,sys
from Queue import Empty
#...
compressor = bz2.BZ2Compressor(9)
f = open(path, 'a')
try:
while 1:
m = queue.get(True, 1*60)
f.write(compressor.compress(m+"\n"))
except Empty, e:
pass
except Exception as e:
traceback.print_exc()
finally:
sys.stderr.write("flushing")
f.write(compressor.flush())
f.close()
A
pbzip2
stream is nothing more than the concatenation of multiplebzip2
streams.An example using the shell:
I've never used python's
bz2
module, but it should be easy to close/reopen a stream in'a'
ppend mode, every so-many bytes, to get the same result. Note that ifBZ2File
is constructed from an existing file-like object, closing theBZ2File
will not close the underlying stream (which is what you want here).I haven't measured how many bytes is optimal for chunking, but I would guess every 1-20 megabytes - it definitely needs to be larger than the bzip2 block size (900k) though.
Note also that if you record the compressed and uncompressed offsets of each chunk, you can do fairly efficient random access. This is how the
dictzip
program works, though that is based ongzip
.