Read Contents Tarfile into Python - "seeking backwards is not allowed"

7.2k views Asked by At

I am new to python. I am having trouble reading the contents of a tarfile into python.

The data are the contents of a journal article (hosted at pubmed central). See info below. And link to tarfile which I want to read into Python.

http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz

I have a list of similar .tar.gz file I will eventually want to read in as well. I think (know) all of the tarfiles have a .nxml file associated with them. It is the content of the .nxml files I am actually interested in extracting/reading. Open to any suggestions on the best way to do this...

Here is what I have if I save the tarfile to my PC. All runs as expected.

tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

I learned today that to in order to access the tarfile directly from the pubmed centrals FTP site I have to set up a network request using urllib. Below is the revised code (and link to stackoverflow answer I received):

Read contents of .tar.gz file from website into a python 3.x object

tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")

However, when I run the remaining piece of the code (below) I get an error message ("seeking backwards is not allowed"). How come?

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

The code fails on the last line, where I try to read the .nxml content associated with my tarfile. Below is the actual error message I receive. What does it mean? What is my best workaround for reading/accessing the content of these .nxml files which are all embedded in tarfiles?

Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed 

Thanks in advance for your help. Chris

4

There are 4 answers

3
Damian Yerrick On BEST ANSWER

What's going wrong: Tar files are stored interleaved. They come in the order header, data, header, data, header, data, etc. When you enumerated the files with getmembers(), you've already read through the entire file to get the headers. Then when you asked the tarfile object to read the data, it tried to seek backward from the last header to the first data. But you can't seek backward in a network stream without closing and reopening the urllib request.

How to work around it: You'll need to download the file, save a temporary copy to disk or to a StringIO, enumerate the files in this temporary copy, and then extract the files you want.

#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile

tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)

# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
    # Download a piece of the file from the connection
    s = ftpstream.read(16384)

    # Once the entire file has been downloaded, tarfile returns b''
    # (the empty bytes) which is a falsey value
    if not s:  
        break

    # Otherwise, write the piece of the file to the temporary file.
    tmpfile.write(s)
ftpstream.close()

# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file.  Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)

# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")

# You want to limit it to the .nxml files
tfile_members2 = [filename
                  for filename in tfile.getnames()
                  if filename.endswith('.nxml')]

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

# And when you're done extracting members:
tfile.close()
tmpfile.close()
1
jmunsch On

I had the same error when trying to requests.get the file, so I extracted all to a tmp directory instead of using BytesIO, or extractfile(member):

# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:        
    tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
    with open(os.path.join(t, fn)) as payload:
        print(payload.read())
0
Selali Adobor On

tl;dr: Remove getmembers to keep the stream

 with tarfile.open(fileobj=response_body, mode="r|gz") as my_tar:
    for member in my_tar: # move to the next file each loop
        current_file_contents = my_tar.extractfile(member) 

This is an extremely old question, but the answer is drastically different than what most people who run into this would want.

If you have a streaming source, you almost certainly do want to have the pipe operator in your mode (like r|):

Pretend that the entire file is a timeline where X marks your current position, and * marks the first file you want to read

  • You open the tar and you're at the start: X---*------
  • You call getmembers. It needs to read the whole tar to tell you where all the files are, so now your position is at the end of the file: ---*------X
  • You try to go back to your file with extractfile ---*X-----... but going backwards is not allowed because streams are one-way (once you move past a chunk of the file, the last one gets thrown out)

Instead you can skip the call to getmembers and simply go one file at a time:

  • Open tar: X---*-----
  • Come across a file: ---X*-----
  • Read it with extractfile: ---*X----
  • Repeat until you reach the end the end ---*------X

The difference in code is tiny, you take this:

   with tarfile.open(fileobj=response_body, mode="r|gz") as my_tar:
        for member in my_tar.getmembers(): #getmembers moves through the entire Tar file
            current_file_contents = log_tar.extractfile(member)

And simply remove the call to get_members:

   with tarfile.open(fileobj=response_body, mode="r|gz") as my_tar:
        for member in my_tar: # move to the next file each loop
            current_file_contents = my_tar.extractfile(member) 

All of the current answers require downloading the entire file, when instead you could deal with it in chunks and save a significant amount of resources depending on how large the objects you're dealing with are.

Unless you can't do anything at all without knowing every single filename in your archive, there's no reason to throw out the streaming response and wait for it to be written to disk.

0
Oded BD On

A very easy solution to this is to change how tarfile reads the file instead of:

tfile = tarfile.open(tarfile_name)

change to:

with tarfile.open(fileobj=f, mode='r:*') as tar:

and the important part is to put ':' in the mode.

you can check this answer as well to read more about it