It's well-known that recompression (compressing a compressed dataset) generally yields a very low (or no) compression.
Hence, I've been very surprised to find a dataset where the second ZIP recompression yields a roughly 50% compressions factor (tested via two runs of the Unix zip
tool, on maximum compression factor (9
)).
Therefore, I'm curious: which characteristics (limits) of the Deflate algorithm, can cause such behavior? I've tried other programs, and for example, algorithms like zstd yield a much better compression on the first pass.
For reference, the dataset is here.
Deflate is limited to a match length of 258 bytes. If there are often repeated strings much longer than that, then a second compression may yield fruit. zstd codes match lengths up to 128K bytes.
gzip -9 once on your data gives 36340381 bytes. gzip -9 twice on your data gives 18860632 bytes. zstd -9 once gives 18985681 bytes. A second zstd -9 reduces it only slightly to 18882745 bytes.