zipfile cant handle some type of zip data?

17.6k views Asked by At

I came up over this problem while trying to decompress a zip file.

-- zipfile.is_zipfile(my_file) always returns False, even though the UNIX command unzip handles it just fine. Also, when trying to do zipfile.ZipFile(path/file_handle_to_path) I get the same error

-- the file command returns Zip archive data, at least v2.0 to extract and using less on the file it shows:

PKZIP for iSeries by PKWARE
 Length      Method Size      Cmpr Date       Time  CRC-32    Name
 2113482674  Defl:S 204502989  90% 2010-11-01 08:39 2cee662e  myfile.txt
 2113482674         204502989  90%                            1 file

Any ideas how can I go around this issue ? It would be nice if I could make python's zipfile work since I already have some unit tests that I'll have to drop if I'll switch to running subprocess.call("unzip")

3

There are 3 answers

0
Rockallite On
# Utilize mmap module to avoid a potential DoS exploit (e.g. by reading the
# whole zip file into memory). A bad zip file example can be found here:
# https://bugs.python.org/issue24621

import mmap
from io import UnsupportedOperation
from zipfile import BadZipfile

# The end of central directory signature
CENTRAL_DIRECTORY_SIGNATURE = b'\x50\x4b\x05\x06'


def repair_central_directory(zipFile):
    if hasattr(zipFile, 'read'):
        # This is a file-like object
        f = zipFile
        try:
            fileno = f.fileno()
        except UnsupportedOperation:
            # This is an io.BytesIO instance which lacks a backing file.
            fileno = None
    else:
        # Otherwise, open the file with binary mode
        f = open(zipFile, 'rb+')
        fileno = f.fileno()
    if fileno is None:
        # Without a fileno, we can only read and search the whole string
        # for the end of central directory signature.
        f.seek(0)
        pos = f.read().find(CENTRAL_DIRECTORY_SIGNATURE)
    else:
        # Instead of reading the entire file into memory, memory-mapped the
        # file, then search it for the end of central directory signature.
        # Reference: https://stackoverflow.com/a/21844624/2293304
        mm = mmap.mmap(fileno, 0)
        pos = mm.find(CENTRAL_DIRECTORY_SIGNATURE)
        mm.close()
    if pos > -1:
        # size of 'ZIP end of central directory record'
        f.truncate(pos + 22)
        f.seek(0)
        return f
    else:
        # Raise an error to make it fail fast
        raise BadZipfile('File is not a zip file')
0
Tom Zych On

You say using less on the file it shows such and such. Do you mean this?

less my_file

If so, I would guess these are comments that the zip program put in the file. Looking at a user guide for the iSeries PKZIP I found on the web, this appears to be the default behavior.

The docs for zipfile say "This module does not currently handle ZIP files which have appended comments." Perhaps this is the problem? (Of course, if less shows them, this would seem to imply that they're prepended, FWIW.)

It appears you (or whoever created the zipfile on an iSeries machine) can turn this off with ARCHTEXT(*NONE), or use ARCHTEXT(*CLEAR) to remove it from an existing zipfile.

2
Uri Cohen On

Run into the same issue on my files and was able to solve it. I'm not sure how they were generated, like in the above example. They all had trailing data in the end ignored by both Windows by 7z and failing python's zipfile.

This is the code to solve the issue:

def fixBadZipfile(zipFile):  
     f = open(zipFile, 'r+b')  
     data = f.read()  
     pos = data.find('\x50\x4b\x05\x06') # End of central directory signature  
     if (pos > 0):  
         self._log("Truncating file at location " + str(pos + 22) + ".")  
         f.seek(pos + 22)   # size of 'ZIP end of central directory record' 
         f.truncate()  
         f.close()  
     else:  
         # raise error, file is truncated