Read contents of .tar.gz file from website into a python 3.x object

6.2k views Asked by At

I am new to python. I can't figure out what I am doing wrong when trying to read the contents of .tar.gz file into python. The tarfile I would like to read is hosted at the following web address:

ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz

more info on file at this site (just so you can trust contents) http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901

The tarfile contains .pdf and .nxml copies of the journal article. And also a couple of image files.

If I open the file in my browser by copying and pasting. I can save to a location on my PC and import the tarfile fine using the following commands (note: winzip changes the file from .tar.gz to simply .tar when I save to location):

import tarfile
thetarfile = "C:/Users/dfcm/Documents/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar"
tfile = tarfile.open(thetarfile)
tfile

However, if I try to access the file directly using similar commands:

thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
bbb = tarfile.open(thetarfile)

That results in the following error:

 Traceback (most recent call last):
 File "<pyshell#137>", line 1, in <module>
 bbb = tarfile.open(thetarfile)
 File "C:\Python30\lib\tarfile.py", line 1625, in open
 return func(name, "r", fileobj, **kwargs)
 File "C:\Python30\lib\tarfile.py", line 1687, in gzopen
 fileobj = bltn_open(name, mode + "b")
 File "C:\Python30\lib\io.py", line 278, in __new__
 return open(*args, **kwargs)
 File "C:\Python30\lib\io.py", line 222, in open
 closefd)
 File "C:\Python30\lib\io.py", line 615, in __init__
 _fileio._FileIO.__init__(self, name, mode, closefd)
 IOError: [Errno 22] Invalid     argument: 'ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar'

Can anyone explain what I am doing wrong when trying to read the .tar.gz file directly from the web address? Thanks in advance. Chris

1

There are 1 answers

0
Helmut Grohne On BEST ANSWER

Unfortunately you cannot just open files from the network. Things are a bit more complex here. You have to instruct the interpreter to create a network request and create an object representing the request state. This can be done using the urllib module.

import urllib.request
import tarfile
thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(thetarfile)
thetarfile = tarfile.open(fileobj=ftpstream, mode="r|gz")

The ftpstream object is a file-like that represents the connection to the ftp server. Then the tarfile module can access this stream. Since we do not pass the filename, we have to specify the compression in the mode parameter.