Download bz2, Read compress files in memory (avoid memory overflow)

585 views Asked by At

As title says, I'm downloading a bz2 file which has a folder inside and a lot of text files...

My first version was decompressing in memory, but Although it is only 90mbs when you uncomrpess it, it has 60 files of 750mb each.... Computer goes bum! obviusly cant handle like 40gb of ram XD)

So, The problem is that they are too big to keep all in memory at the same time... so I'm using this code that works but its sucks (Too slow):

response = requests.get('https:/fooweb.com/barfile.bz2')

# save file into disk:
compress_filepath = '{0}/files/sources/{1}'.format(zsets.BASE_DIR, check_time)
with open(compress_filepath, 'wb') as local_file:
    local_file.write(response.content)

#We extract the files into folder 
extract_folder = compress_filepath + '_ext'
with tarfile.open(compress_filepath, "r:bz2") as tar:
    tar.extractall(extract_folder)

# We process one file at a time:
for filename in os.listdir(extract_folder):
    filepath = '{0}/{1}'.format(extract_folder,filename)
    file = open(filepath, 'r').readlines()
    
    for line in file:
        some_processing(line)

Is there a way I could make this without dumping it to disk... and only decompressing and reading one file from the .bz2 at a time?

Thank you very much for your time in advance, I hope somebody knows how to help me with this...

2

There are 2 answers

1
Mark Adler On
#!/usr/bin/python3
import sys
import requests
import tarfile
got = requests.get(sys.argv[1], stream=True)
with tarfile.open(fileobj=got.raw, mode='r|*') as tar:
    for info in tar:
        if info.isreg():
            ent = tar.extractfile(info)
            # now process ent as a file, however you like
            print(info.name, len(ent.read()))
0
Marcos Federico Mandrille On

I did it this way:

response = requests.get(my_url_to_file)
memfile = io.BytesIO(response.content)
# We extract files in memory, one by one:
tar = tarfile.open(fileobj=memfile, mode="r:bz2")
for member_name in tar.getnames():
    filecount+=1
    file = tar.extractfile(member_name)
 
    with open(file, 'r') as read_file:
        for line in read_file:
            process_line(line)