Getting a single file from a tar file using the tarfile lib in python

1.5k views Asked by At

I am trying to grab a single file from a tar archive. I have the tarfile library and I can do things like find the file in a list with the right extension:

like their example:

def xml_member_files(self,members): 
    for tarinfo in members:
        if os.path.splitext(tarinfo.name)[1] == ".xml":
            yield tarinfo


    member_file = self.xml_member_files(tar)
    for m in member_file:           
        print m.name

This is great and the output is:

RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutBeta.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutGamma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutSigma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/product.xml

If I say just look for product.xml then it doesn't work. So I tried this:

    ti = tar.getmember('product.xml')
    print ti.name

and it doesn't find product.xml because I am guessing the path information before hand. I have no idea how to retrieve just that pathing information so I can get at my product.xml file once extracted (feels like I am doing things the hard way anyway) but yah, how do I figure out just that path so I can concatenate it to my other file functions to read and load that xml file after it is the only file extracted from a tar file?

3

There are 3 answers

0
pbuck On BEST ANSWER

Return full path by iterating over result of getnames(). For example, to get full path for lutBeta.xml:

tar = tarfile.TarFile('mytarfile.tar')
membername = [x for x in tar.getnames() if os.path.basename(x) == 'lutBeta.xml'][0]
0
Alex G Rice On

I would try first doing TarFile.getnames(), which I imagine works a lot like tar tzf filename.tar.gz from the command line. Then you get find out what paths to feed to your getmember() or getmembers().

0
Slim On

You don't want to be iterating over the entire tar with getnames(), getmember() or getmembers(), because as soon as you find your file, you don't need to keep looking through the rest of the tar.

for example, it takes my machine about 47ms to extract a single file from a 2GB tar by iterating over all the file names:

with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
    membername = [x for x in tar.getnames() if x.endswith('myfile.txt')][0]
    file = tar.extractfile(membername).read().decode()

But stopping as soon as the file is found takes me only 0.27 ms, nearly 175x faster.

file = None
with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
    for member in tar:
        if member.name.endswith('myfile.txt'):
            file = tar.extractfile(member).read().decode()
            break

Note if the file you need is more near the end of the archive, you probably won't notice much of a change in speed, but it is still a good practice to not loop through the whole file if you don't have to.