I'm trying to figure out if there's a way to calculate a minimum required size for an output buffer, based on the size of the input buffer.
This question is similar to zlib, deflate: How much memory to allocate?, but not the same. I am asking about each chunk in isolation, rather than the entire stream.
So suppose we have two buffers: INPUT
and OUTPUT
, and we have a BUFFER_SIZE
, which is - say, 4096 bytes. (Just a convenient number, no particular reason I choose this size.)
If I deflate using:
deflate(stream, Z_PARTIAL_FLUSH)
so that each chunk is compressed, and immediately flushed to the output buffer, is there a way I can guarantee I'll have enough storage in the output buffer without needing to reallocate?
Superficially, we'd assume that the DEFLATED data will always be larger than the uncompressed input data (assuming we use a compression level that is greater than 0.)
Of course, that's not always the case - especially for small values. For example, if we deflate a single byte, the deflated data will obviously be larger than the uncompressed data, due to the overhead of things like headers and dictionaries in the LZW stream.
Thinking about how LZW works, it would seem if our input data is at least 256 bytes (meaning that worst case scenario, every single byte is different and we can't really compress anything), we should realize that input size LESS than 256 bytes + zlib headers could potentially require a LARGER output buffer.
But, generally, realworld applications aren't going to be compressing small sizes like that. So assuming an input/output buffer of something more like 4K, is there some way to GUARANTEE that the output compressed data will be SMALLER than the input data?
(Also, I know about deflateBound
, but would rather avoid it because of the overhead.)
Or, to put it another way, is there some minimum buffer size that I can use for input/output buffers that will guarantee that the output data (the compressed stream) will be smaller than the input data? Or is there always some pathological case that can cause the output stream to be larger than the input stream, regardless of size?
Information theory dictates that there must always be pathological cases which "compress" to something larger.
This page starts off with the worst case encoding sizes for zlib - looks like the worst case growth is 6 bytes, plus 5 bytes per started 16KB block. So if you always flush after less than 16KB, having buffers which are 11 bytes plus your flush interval should be safe.
Unless you have tight control over the type of data you're compressing, finding pathological cases isn't hard. Any random number generator will find you some pretty quickly.