Is there a way to get the maximum possible size of a gzip/brotli encoded string?

167 views Asked by At

I need to send a string of comma delimited numbers through to an API. The numbers are in the range 0-99, they are sorted in ascending order (but this can be done on our end either when creating the string, or when we lookup the string later), each number is either present or omitted (no duplicates), and the string has to contain at least one number (up to all 100 within the range). To my knowledge, this is a combination without repetition, and because the string can contain a single number up to all 100, we do a summation where the number of selected elements is 1-100. The math behind the number of possible combinations (that makes up the string) doesn't really matter, but there is a significant number of possible combinations.

If I create a string of all 100 numbers in the range of 0-99 like so:

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99

The resultant string has a length of 289 characters - this is the maximum length of the string (in plaintext). This is a problem because the API only supports a maximum size of 255 characters for this field. I can compress of the string to shorten it (the format that we send to the API doesn't really matter as long as we can recover the plaintext later), but it has to be present.

If I gzip the aforementioned string, I get:

H4sIAAAAAAAACg2QSQEAMAzCDOVRete/sU0AkGAIJ0iKZlgOGRJyFChRoUaDFh1u+M84HnjihTc++OJHGCHiVwaRRBFNDLHEkUaKdPIvJllkk0MueZRRopwK6gMV1dRQSx1ttGing0768zY99NLHGCPGmWCSKebrDLPMscaKdTbYZItt9tsue5xx4pwLLrnimhvun3EPU7LMNyEBAAA=

The gzip-compressed string is 212 characters - which fits within the 255 character limit.

If I switch from gzip compression to Brotli, I get:

GyAB6I2ULmco0rFhLAZ3gWApSZCLltuPx6fnl9c3b989fACEYATFYHH8BEnRDMvh8vQFUZIVVaPV+RumZTuux+vLD8IoTtJMNtdflFXdtJ1ub/5gOBpPpjOzc/sXy9V6s93Z3bt/OJ7Ol+vNuQM=

The Brotli-compressed string is 148 characters long - which is a significant saving compared to both plaintext and gzip.

I know that lossless compression algorithms (like gzip and Brotli) work based on their ability to remove redundant data. My concern is that since the plaintext string is a combination of the numbers 0-99, and there are a lot of possible combinations, that there could be combinations (with little redundancy) that compress to more than 255 characters. I'm also not sure whether there are any cases where the compressed string would exceed the length of the the two samples above (where the maximum length plaintext is reduced to 212 chars with gzip/148 chars with Brotli).

Does anyone know if there is a way for me to find the upper bound for the compressed string length given all of the possible permutations?

0

There are 0 answers