I am new to python. I am having trouble reading the contents of a tarfile into python.
The data are the contents of a journal article (hosted at pubmed central). See info below. And link to tarfile which I want to read into Python.
http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz
I have a list of similar .tar.gz file I will eventually want to read in as well. I think (know) all of the tarfiles have a .nxml file associated with them. It is the content of the .nxml files I am actually interested in extracting/reading. Open to any suggestions on the best way to do this...
Here is what I have if I save the tarfile to my PC. All runs as expected.
tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)
tfile_members = tfile.getmembers()
tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)
tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
tfile_members2.append(tfile_members1[i])
tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()
I learned today that to in order to access the tarfile directly from the pubmed centrals FTP site I have to set up a network request using urllib
. Below is the revised code (and link to stackoverflow answer I received):
Read contents of .tar.gz file from website into a python 3.x object
tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")
However, when I run the remaining piece of the code (below) I get an error message ("seeking backwards is not allowed"). How come?
tfile_members = tfile.getmembers()
tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)
tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
tfile_members2.append(tfile_members1[i])
tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()
The code fails on the last line, where I try to read the .nxml content associated with my tarfile. Below is the actual error message I receive. What does it mean? What is my best workaround for reading/accessing the content of these .nxml files which are all embedded in tarfiles?
Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed
Thanks in advance for your help. Chris
What's going wrong: Tar files are stored interleaved. They come in the order header, data, header, data, header, data, etc. When you enumerated the files with
getmembers()
, you've already read through the entire file to get the headers. Then when you asked the tarfile object to read the data, it tried to seek backward from the last header to the first data. But you can't seek backward in a network stream without closing and reopening the urllib request.How to work around it: You'll need to download the file, save a temporary copy to disk or to a StringIO, enumerate the files in this temporary copy, and then extract the files you want.