ZipFile.testzip() returning different results on Python 2 and Python 3

4.9k views Asked by At

Using the zipfile module to unzip a large data file in Python works correctly on Python 2 but produces the following error on Python 3.6.0:

BadZipFile: Bad CRC-32 for file 'myfile.csv'

I traced this to error handling code checking the CRC values.

Using ZipFile.testzip() on Python 2 returns nothing (all files are fine). Running it on Python 3 returns 'myfile.csv' indicating a problem with that file.

Code to reproduce on both Python 2 and Python 3 (involves a 300 MB download, sorry):

import zipfile
import urllib
import sys

url = "https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Amphibia_Sep2016/VertNet_Amphibia_Sept2016.zip"

if sys.version_info >= (3, 0, 0):
    urllib.request.urlretrieve(url, "vertnet_latest_amphibians.zip")
else:
    urllib.urlretrieve(url, "vertnet_latest_amphibians.zip")

archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")
archive.testzip()

Does anyone understand why this difference exists and if there's a way to get Python 3 to properly extract the file using:

archive.extract("vertnet_latest_amphibians.csv")
4

There are 4 answers

0
Nick Matteo On BEST ANSWER

The CRC value is OK. The CRC of 'vertnet_latest_amphibians.csv' recorded in the zip is 0x87203305. After extraction, this is indeed the CRC of the file.

However, the given uncompressed size is incorrect. The zip file records compressed size of 309,723,024 bytes, and uncompressed size of 292,198,614 bytes (that's smaller!). In reality, the uncompressed file is 4,587,165,910 bytes (4.3 GiB). This is bigger than the 4 GiB threshold where 32-bit counters break.

You can fix it like this (this worked in Python 3.5.2, at least):

archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")
archive.getinfo("vertnet_latest_amphibians.csv").file_size += 2**32
archive.testzip() # now passes
archive.extract("vertnet_latest_amphibians.csv") # now works
0
Lawrence On

I was unable to get Python 3 to extract from the archive. Some results from an investigation (on Mac OS X) that might be helpful.

Check the health of the archive

Make the file read-only in order to prevent accidental changes:

$ chmod -w vertnet_latest_amphibians.zip 
$ ls -lh vertnet_latest_amphibians.zip 
-r--r--r-- 1 lawh 2045336417 296M Jan  6 10:10 vertnet_latest_amphibians.zip

Check the archive using zip and unzip:

$ zip -T vertnet_latest_amphibians.zip
test of vertnet_latest_amphibians.zip OK

$ unzip -t vertnet_latest_amphibians.zip
Archive:  vertnet_latest_amphibians.zip
    testing: VertNet_Amphibia_eml.xml   OK
    testing: __MACOSX/                OK
    testing: __MACOSX/._VertNet_Amphibia_eml.xml   OK
    testing: vertnet_latest_amphibians.csv   OK
    testing: __MACOSX/._vertnet_latest_amphibians.csv   OK
No errors detected in compressed data of vertnet_latest_amphibians.zip

As also found by @sam-mussmann, 7z reports a CRC error:

$ 7z t vertnet_latest_amphibians.zip 

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)

Scanning the drive for archives:
1 file, 309726398 bytes (296 MiB)

Testing archive: vertnet_latest_amphibians.zip
--
Path = vertnet_latest_amphibians.zip
Type = zip
Physical Size = 309726398

ERROR: CRC Failed : vertnet_latest_amphibians.csv

Sub items Errors: 1

Archives with Errors: 1

Sub items Errors: 1

My zip and unzip are both rather old; 7z is pretty new:

$ zip -v | head -2
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.

$ unzip -v | head -1
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.

$ 7z --help |head -3

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)

Extract

Using unzip:

$ time unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv
Archive:  vertnet_latest_amphibians.zip
  inflating: vertnet_latest_amphibians.csv  

real    0m17.201s
user    0m14.281s
sys 0m2.460s

Extract using Python 2.7.13, using zipfile's command-line interface for brevity:

$ time ~/local/python-2.7.13/bin/python2 -m zipfile -e vertnet_latest_amphibians.zip .

real    0m19.491s
user    0m12.996s
sys 0m5.897s

As you found, Python 3.6.0 (also 3.4.5 and 3.5.2) reports a bad CRC

Hypothesis 1: The archive contains a bad CRC that zip, unzip and Python 2.7.13 are failing to detect; 7z and Python 3.4-3.6 are all doing the right thing.

Hypothesis 2: The archive is fine; 7z and Python 3.4-3.6 all contain a bug.

Given the relative ages of these tools, I would guess that H1 is correct.

Workaround

If you are not using Windows and trust the contents of the archive, it might be more straightforward to use regular shell commands. Something like:

wget <the-long-url> -O /tmp/vertnet_latest_amphibians.zip
unzip /tmp/vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv
rm -rf /tmp/vertnet_latest_amphibians.zip

Or you could execute unzip from within Python:

import os
os.system('unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv')

Incidental

It is slightly neater to catch ImportError than to check the version of the Python interpreter:

try:
    from urllib.request import urlretrieve
except ImportError:
    from urllib import urlretrieve
0
henry On

As @Kundor, setting the file_size to the maximum (2**32 - 1) will work but fail for any file greater than 4 GiB(4 GiB minus 1 byte) hence set it to the maximum size for ZIP64 (16 EiB minus 1 byte)

Tested on (927MB compresed and 11GB of file_to_extract)

url: https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Aves_Sep2016/VertNet_Aves_Sept2016.zip

file: vertnet_latest_birds.csv

import zipfile
import urllib
import sys

url = "https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Amphibia_Sep2016/VertNet_Amphibia_Sept2016.zip"
zip_path = "vertnet_latest_amphibians.zip"
file_to_extract = "vertnet_latest_amphibians.csv"

if sys.version_info >= (3, 0, 0):
    urllib.request.urlretrieve(url, zip_path)
else:
    urllib.urlretrieve(url, zip_path)

archive = zipfile.ZipFile(zip_path)
if archive.testzip():
    # reset uncompressed size header values to maximum
    archive.getinfo(file_to_extract).file_size += (2 ** 64) - 1
    
open_archive_file = archive.open(file_to_extract, 'r')
# or archive.extract(file_to_extract)
0
Robert Lujo On

In my case the issue was the wrong ZipInfo.file_size (Python 2.7) when compared to the actual size of the file when extracted (as @nick-matteo discovered). I found out that the cause of the file size mismatch was in passing unicode string to zipfile.writestr() function.

In my case solution was to encode unicode to utf8 before passing to writestr() function:

zf = zipfile.ZipFile(...)
if isinstance(file_contents, unicode):
    file_contents = file_contents.encode("utf8")
zf.writestr("filename.txt", file_contents)
...