How to decompress *.bz2 file in memory with python? The bz2 file comes from a csv file.
I use the code below to decompress it in memory, it works, but it brings some dirty data such as filename of the csv file and author name of it, is there any other better way to handle it?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import StringIO
import bz2
with open("/app/tmp/res_test.tar.bz2", "rb") as f:
content = f.read()
compressedFile = StringIO.StringIO(content)
decompressedFile = bz2.decompress(compressedFile.buf)
compressedFile.seek(0)
with open("/app/tmp/decompress_test", 'w') as outfile:
outfile.write(decompressedFile)
I found this question, it is in gzip, however my data is in bz2 format, I try to do as instructed in it, but it seems that bz2 could not handle it in this way.
Edit:
No matter the answer of @metatoaster or the code above, both of them will bring some more dirty data into the final decompressed file. For example: my original data is attached below and in csv format with the name res_test.csv:
Then I cd into the directory where the file is in and compress it with tar -cjf res_test.tar.bz2 res_test.csv
and get the compressed file res_test.tar.bz2, this file could simulate the bz2 data that I will get from internet and I wish to decompress it in memory without cache it into disk first, but what I get is data below and contains too much dirty data:
The data is still there, but submerged in noise, does it possible to decompress it into pure data just the same as the original data instead of decompress it and extract real data from too much noise?
For generic bz2 decompression,
BZ2File
class may be used.content
should contain the decompressed contents of the file.However, given that this is a
tar
file (an archive file that is normally extracted to disk as a directory of files), thetarfile
module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains ares_test.csv
, the following can be usedThe
r:bz2
flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative methodr|bz2
makes it impractical to call extract files from the members it return byextractfile
. The second line simply callsextractfile
to return the contents of'res_test.csv'
from the archive file as a string.The transparent open mode (
'r:*'
) is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.Naturally, the
tarfile
module has a lower levelopen
method which may be used on arbitrary stream objects. If the file was already opened usingBZ2File
already, this can also be used