I have a large .tar.xz file that I am downloading with python requests that needs to be decompressed before writing to the disk (Due to limited disk space). I have a solution which works for smaller files, but larger files hang indefinitely.
import io
import requests
import tarfile
session = requests.Session()
response = session.get(url, stream=True)
compressed_data = io.BytesIO(response.content)
tar = tarfile.open(mode='r|*' ,fileobj=compressed_data, bufsize=16384)
tar.extractall(path='/path/')
It hangs at io.BytesIO for larger files.
Is there a way to pass the stream to fileobj without reading the entire stream? or is there a better approach to this?
You should utilize the
lzmalibrary to decompress.xzfiles. Download large files in chunks (to be memory efficient) and decompress them, then write to disk. Here's a script that I use on my server to download largetar.xzonce a week, the file sizes are typically around 6GB. This should work for you too.chunk_size=32 * 1024modify the chunk size according to your specs.Now, if you insist on using
io, modify your code according to download and decompress in chunks. Your code hangs because it is trying to download all at once, hence running out of memory. To download large files, the file have to be downloaded in chunks to be memory efficient.