It's well-known that recompression (compressing a compressed dataset) generally yields a very low (or no) compression.

Hence, I've been very surprised to find a dataset where the second ZIP recompression yields a roughly 50% compressions factor (tested via two runs of the Unix zip tool, on maximum compression factor (9)).

Therefore, I'm curious: which characteristics (limits) of the Deflate algorithm, can cause such behavior? I've tried other programs, and for example, algorithms like zstd yield a much better compression on the first pass.

For reference, the dataset is here.

1

There are 1 answers

4
Mark Adler On

Deflate is limited to a match length of 258 bytes. If there are often repeated strings much longer than that, then a second compression may yield fruit. zstd codes match lengths up to 128K bytes.

gzip -9 once on your data gives 36340381 bytes. gzip -9 twice on your data gives 18860632 bytes. zstd -9 once gives 18985681 bytes. A second zstd -9 reduces it only slightly to 18882745 bytes.