First, let's generate a compressed tar archive:
from io import BytesIO
from tarfile import TarInfo
import tarfile
with tarfile.open('foo.tgz', mode='w:gz') as archive:
for file_name in range(1000):
file_info = TarInfo(str(file_name))
file_info.size = 100_000
archive.addfile(file_info, fileobj=BytesIO(b'a' * 100_000))
Now, if I read the archive contents in natural order:
import tarfile
with tarfile.open('foo.tgz') as archive:
for file_name in archive.getnames():
archive.extractfile(file_name).read()
and measure the execution time using the time command, I get less than 1 second on my PC:
real 0m0.591s
user 0m0.560s
sys 0m0.011s
But if I read the archive contents in reverse order:
import tarfile
with tarfile.open('foo.tgz') as archive:
for file_name in reversed(archive.getnames()):
archive.extractfile(file_name).read()
the execution time is now around 120 seconds:
real 2m3.050s
user 2m0.910s
sys 0m0.059s
Why is that? Is there some bug in my code? Or is it some tar's feature? Is it documented somewhere?
A
tarfile is strictly sequential. You end up reading the beginning of the file 1000 times, rewinding between them, reading the second member 999 times, etc etc.Remember, the "tape archive" format was designed at a time when unidirectional tape reels on big spindles was the hardware they used. Having an index would only have wasted space on the tape, as you would literally have to read every byte between where you are and where you want to seek to on the tape anyway.
In contrast, modern archive formats like
.zipare designed for use on properly seekable devices, and typically contain an index which lets you quickly move to the position where a specific archive member can be found.