Decompressing large streams with Python tarfile

118 views Asked by At

I have a large .tar.xz file that I am downloading with python requests that needs to be decompressed before writing to the disk (Due to limited disk space). I have a solution which works for smaller files, but larger files hang indefinitely.

import io
import requests
import tarfile
session = requests.Session()
response = session.get(url, stream=True)

compressed_data = io.BytesIO(response.content)
tar = tarfile.open(mode='r|*' ,fileobj=compressed_data, bufsize=16384)
tar.extractall(path='/path/')

It hangs at io.BytesIO for larger files.

Is there a way to pass the stream to fileobj without reading the entire stream? or is there a better approach to this?

2

There are 2 answers

8
Musabbir Arrafi On

You should utilize the lzma library to decompress .xz files. Download large files in chunks (to be memory efficient) and decompress them, then write to disk. Here's a script that I use on my server to download large tar.xz once a week, the file sizes are typically around 6GB. This should work for you too.

import requests
import lzma
import tarfile
import os
import tempfile

url = 'your tar.xz url'

with requests.get(url, stream=True) as response:
    response.raise_for_status()

    # Initialize LZMA decompressor
    decompressor = lzma.LZMADecompressor()

    # Create a temporary file to store the decompressed data
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        for chunk in response.iter_content(chunk_size=32 * 1024):
            data = decompressor.decompress(chunk)
            tmp_file.write(data)

        # Get the name of the temporary file
        tmp_file_name = tmp_file.name

# Now extract from the temporary file
with tarfile.open(tmp_file_name, mode="r") as tar:
    tar.extractall(path="/home/arrafi/")

# Clean up the temporary file
os.remove(tmp_file_name)

chunk_size=32 * 1024 modify the chunk size according to your specs.

Now, if you insist on using io, modify your code according to download and decompress in chunks. Your code hangs because it is trying to download all at once, hence running out of memory. To download large files, the file have to be downloaded in chunks to be memory efficient.

0
cards On

Use response.iter_content to stream a tar.xz-download into chunks which will be incrementally decompress in-memory by LZMADecompressor and passed to a buffer. Once the downloaded is terminated the tar-archive contained in the buffer will be extracted.

from io import BytesIO
from lzma import LZMADecompressor, FORMAT_XZ
import tarfile
import requests
import contextlib


url = '' #                                 url of the data
chunk_size = 2**14 #                       just an example
path_tar_archive_dir = 'downloaded_tar' #  location of the extraction of the archive


with requests.Session() as session:
    response = session.get(url, stream=True)

    # create buffer to store streaned data
    with BytesIO() as bw:
        # xz-decompressor
        d = LZMADecompressor(format=FORMAT_XZ) # also without arguments

        # read incoming data
        for chunk in response.iter_content(chunk_size=chunk_size):
            # in-memory automatic incremental decompression
            data = d.decompress(chunk)
            if data:
                bw.write(data)
            
        if not d.eof:
            raise Exception('EOF of streaming data not reached')
        
        print('[OK] Download and xz-decompression of stream data')
        # set stream position at the start
        bw.seek(0)
        
        # temporary shutdown tar-stdout
        f = open(os.devnull, 'w')
        with contextlib.redirect_stdout(f):

            # extract the archieve from the buffer
            with tarfile.open(fileobj=bw, mode='r') as tar:
                # extract the archieve at the given directory
                tar.extractall(path_tar_archieve_dir, None, numeric_owner=False)
            
        print('[OK] Extraction tar-archieve')

Notice that the signature of tar.extractall will support a new keyword-only parameter, filter, from Python 3.9.17.

NOTE on performance: tar send all its output to stdout and by redirecting it to devnull the performance of the extraction will dramatically increase!


I set up a Flask server to test the code on localhost. I use a 34 MB tar.xz-archive and it took a while to fully download it. Without the performance trick it will take lot of RAM, in my case up to 200 MB. Instead with redirection to devnull it's execution will be even difficult to notice in terms of time and RAM.

Here my test server.py (just for localhost tests!)

"""
# start the server
$ flask --app server run --debug
"""
from flask import Flask
from flask import send_from_directory
import os


app = Flask(__name__)


@app.route('/archive/', methods=['GET', 'POST'])
def archive():
    # create route http://127.0.0.1:5000/archive/
    
    abs_path_to_archive = # <- here add the path!
    dir_path, basename = os.path.split(abs_path_to_archive)

    return send_from_directory(dir_path, basename, as_attachment=False)

Then start the server in ins own terminal

$ flask --app server run --debug

and run the above program to download the archive using in another terminal with url = "http://127.0.0.1:5000/archive/".